Push! versus preallocation

@ChrisRackauckas
I benchmarked a simple loop of 1,000,000 iterations using both push! and preallocation, both using functions and within the global space, and using const versus non-const arrays when executing in the global space. I got some unexpected results, which perhaps somebody can explain.

using BenchmarkTools

function alloc_push()
    begin
        a = []
        for i in 1:1000000
            push!(a, i)
        end
    end
end

begin  # requires an indent in juno
    println("-----------"); for i in 1:5
        @time alloc_push()
    end;
    println("Execution of function alloc_push()")
    println("Cost is about 0.03 sec, with 1M allocs of total 24MBytes")
    println("----------")
end

const bconst = zeros(Int, 1000000)
bnonconst = zeros(Int, 1000000)
function alloc_pre(b)
    begin
        for i in 1:1000000
            b[i] = i
        end
    end
end

# No allocations whether b
begin
    println("-----");
    println("....const")
    for j in 1:5 @time alloc_pre(bconst); end
    println("...nonconst")
    for j in 1:5 @time alloc_pre(bnonconst); end
end; println("function call to alloc_pre(): memory is preallocated")
    println("If argument is const, time is 0.002 sec")
    println("If argument is non-const, time is 0.0002 sec")
    println("If I increase array size by 10x, all timings increase by 10x")
    println("WHY IS CONST slower?")
    println("-----")
end

# 2M eallocationsif `b` is not a const
begin
    println("------");
    println("...const")
    for j in 1:5
        @time for i in 1:1000000 bconst[i] = i; end
    end
    println("...nonconst")
    for j in 1:5
        @time for i in 1:1000000 bnonconst[i] = i; end
    end
    println("Running in global space, 0.002 sec with const array, with no allocations")
    println("    0.04 sec with non-const array, with 2M allocations of total 30.5MBytes with gc")
    println("Execution outside function in global space: ")
    println("    const is 10x faster as expected")
end;

To summarize (also contained within the code):

  • When executing in the global space, with pre-allocated arrays, using const arrays executes without allocations and 10x faster than the non-const version. That is what I expected.
  • When executing the loops within a function with the array as an argument, and this array can be const or non-const, the results are stranger: whether I use a non-const or a const array, I get no memory allocation. Moreover, using a const array runs 10x slower than using a non-const array. Why would that be?

I am running on a Macbook Pro notebook, on the CPU, with 16 MB of RAM, so memory is not the issue. Can anybody duplicate these results? My goal is to gain a better understanding of when memory is allocated, what I can control and what I cannot, and how to use these findings in code where speed is important.

Thanks.

You’re benchmarking a lot of different things here, and I think the particular way you’re setting up the benchmarks is what’s giving the weird results, not any actual difference in performance.

First of all, there’s no need to run @time repeatedly. Instead, use @btime from BenchmarkTools.jl which will automatically run your function multiple times and ensure you are not measuring compile time.

With that, we can see that alloc_pre does not actually show a 10X difference between a const and non-const global variable as input:

julia> using BenchmarkTools

julia> @btime alloc_pre(bconst)
  487.092 μs (0 allocations: 0 bytes)

julia> @btime alloc_pre(bnonconst)
  480.931 μs (0 allocations: 0 bytes)

There’s a small difference here, but it’s likely just noise, and certainly not a 10x difference.

Second, your alloc_push benchmark is quite different because it constructs a vector of a different type:

julia> a = []
0-element Array{Any,1}

julia> a = zeros(Int, 1000000)
1000000-element Array{Int64,1}:

[] is an Array{Any, 1}, so it will be slower to work with than an Array{Int64, 1} since the compiler has to treat an Array{Any, 1} as if every element could potentially be of a different type.

You want Vector{Int}() or Int[] to initialize a 1D array of Ints.

3 Likes

Your points are well taken, and of course, I did check with @btime. However, why is it that @btime returns a single number as opposed to a min/max timings? Also, how many timings does @btime execute? Maybe something is wrong with my system. Using the same functions as above (I have not yet implemented your suggested changes):


> @btime alloc_push()
19.086 ms (999509 allocations: 24.25 MiB)

> @btime alloc_push()
17.789 ms (999509 allocations: 24.25 MiB)

> @time alloc_push()
0.038910 seconds (999.51 k allocations: 24.252 MiB, 39.81% gc time)

>

I just noticed the difference in units. But I still have to ask: why aren’t the results of @time valid? I execute my loop 5 times. Surely, the 2nd execution should produce a time similar to @btime? In normal code, execution happens only once (once the code is compiled). Also, @btime would be nice, if execution was, say 5x slower than one execution of @time. But that is not the case. What does @btime do that is so expensive computationally?

More precisely, how is this explained?

julia> @time @btime alloc_push()
  17.749 ms (999509 allocations: 24.25 MiB)
 11.136745 seconds (422.11 M allocations: 10.010 GiB, 25.31% gc time)

0.017 sec using @btime, and 11 sec when timing @btime!!! With these kinds of timings, using @btime is not practical. Something must be amiss.

This is all described in detail in the BenchmarkTools.jl manual here: https://github.com/JuliaCI/BenchmarkTools.jl/blob/master/doc/manual.md

You can use @benchmark to report min, mean, median, and max time, and to see how many times the function was executed.

1 Like

I think (or at least thought), that’s the most helpful single number.

The primary macro provided by BenchmarkTools is @benchmark […]
For quick sanity checks, one can use the @btime macro, which is a convenience wrapper around @benchmark

help?> @btime
  @btime expression [other parameters...]

[..]
  Unlike @time, it uses the @benchmark macro[..] The printed time is the minimum elapsed time measured

while I’ve seen arguments against reporting the minumum time (what I found most useful). Median is also useful, but maximum, what you also get can be misleading or showing a problem.

I love the documentation! Thank you. It is rather surprising that BenchmarkTools does not provide a standard deviation, which would be very useful as an indicator of System tools. I notice that loop timings can vary up to factor of 2x under a controlled environment. Isn’t that rather high?

I still have not heard an explanation for why @benchmark takes so long. If I have a loop sequence that takes 88 microsec, and I set
BenchmarkTools.DEFAULT_PARAMETERS.samples = 10, why is it that @btime takes 6 seconds to run? I understand that there is overhead, but there should be more than ample time to set up whatever structures are necessary. Running @btime twice in a role takes the same amount of time. My guess then is that the time has to do with the macro expansion time. Any insights?

Surely, one could write a non-macro version that runs @time 10 times, and computes the statistics. Of course, we would not get the other features offered by the BenchmarkTools package.

Welcome to the world of turbo boosting and throttling of CPU speed. I think difficulties (“2x”) is just to be expected. Timing and predicting speed of code/optimiszations (even -O3 on SPEC thought to be faster was at some point shown to be not faster than -O2) used to be much easier before (and before caches).

@benchmark and @btime run many “samples”, and each sample consists of many “evaluations” (terminology borrowed from the manual). So you’re still running the function far more than just 10 times, which is why it takes so long.

julia> @benchmark sin(1.0)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.210 ns (0.00% GC)
  median time:      1.216 ns (0.00% GC)
  mean time:        1.322 ns (0.00% GC)
  maximum time:     24.604 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

10000 samples * 1000 evals/sample = 10000000 evaluations of the function.

You can force the number of evaluations to be smaller:

julia> @benchmark sin(1.0) evals=1
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     17.000 ns (0.00% GC)
  median time:      19.000 ns (0.00% GC)
  mean time:        26.298 ns (0.00% GC)
  maximum time:     17.787 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

at the cost of increasing the noise of your measurement.

2 Likes

There’s also “warmup” accounted to in the BenchmarkTools code.

It’s a can of worms, I didn’t read that 7-page long paper on it (most issues are shared by other languages):

Consecutive timing measurements can fluctuate, possibly in a correlated manner, in ways which depend on a myriad of factors such as environment temperature, workload, power availability, and network traffic, and […]
Many factors stem from OS behavior, including CPU frequency scaling [2], address space layout randomization (ASLR) [3], virtual memory manage-ment [4], [5], differences between CPU privilege levels [6], context switches due to interrupt handling [7], activity from […]
Authors have also noted the poor [statistical] power of standard techniques such as F-tests or Student t-tests for benchmark timings [10], [15], [17]–[19]. Parametric outlier detection techniques, such as the 3-sigma rule used in benchmarking software like AndroBench[20], can also fail when applied to non-i.i.d. timing measurements.There is a lack of consensus over how non-ideal timing measurements should be treated. […]
To the best of our knowledge, our work is the first benchmarking methodology that can be fully automated, is robust in its assumption of non-i.i.d. timing measurement statistics, and makes efficient use of a limited time budget.

I also link unread, from March 2020 (something I should read myself I guess):
https://dzone.com/articles/introduction-to-benchmarking-in-julia

Yet, comparing the times above, for all statistics pre-allocating the array is slightly worse, even though we’re passing the compiler more knowledge upfront. This didn’t sit well with me, so I consulted the BenchmarkTools.jl manual […]
If you can avoid garbage collection, using six threads here gives nearly a 10x speedup, and at the median where both single-threaded and multi-threaded trigger garbage collection you still get a 2x speedup.

What do you mean by warmup? The compilation? Or something else?
But in any case, if one uses @time 10 times, and throws away the first one, the timings are rather consistent with each other, as I would expect them to be.
I definitely think that selecting the minimum timing returned by @benchmark is not reasonable, although it often makes an algorithm look good. It would be a timing under ideal conditions with no additional system processes going on. That scenario is unlikely when running the program in practice. It also makes a comparison between languages quite difficult. I have quite a bit of experience in the past with several languages.
I still do not see the advantages of @benchmark versus running @time x times, apart from the fact that @benchmark is a macro, part of a package with many options.

I am learning a lot. Thank you for engaging!

I think that’s by design, as not helpful. It and the mean are meaningful for the normal distribution (while “Time distributions are always right-skewed […]”), but the mean, while reported (when asking for more than one number), is NOT helpful (and then not the stdev?). See: https://github.com/JuliaCI/BenchmarkTools.jl/blob/master/doc/manual.md#which-estimator-should-i-use

There’s a lot more to the code I’ve never looked into before:

rmskew!(x::Trial) , rmskew(x::Trial)

Return x (or a copy of x , in the non-mutating case) where samples that positively skew x 's time distribution have been removed. […]

It runs the code many times and does statistics, isn’t that reason enough?

Furthermore, many test functions are much too fast for the resolution of @time handle. How would benchmark functions that run in 5ns with @time?

2 Likes

The tool has no control over “additional” processes, so it wouldn’t be the absolute minimum. But yes, the point you’re making is a bit valid, as I said have seen it made elsewhere that the minimum shouldn’t be used (but that was for some very pathological case). I would then look at the median (not average) instead. Both (and all) estimators tell some story (see the doc I linked), and also, you’re not timing an algorithm, just an implementation of (yes, what you meant, that’s only a technical point, they are analyzed by computer scientists, and constant factor ignored, while it’s very important for benchmarking).

The timing calls the tune! function, but I didn’t locate if the also exported warmup is called implicitly:

I do see in the docs:

warmup(x::Union{Benchmark, BenchmarkGroup}; verbose = true)

Run a single evaluation of x . This can be useful if you think JIT overhead is being incorporated into your results (which is rarely the case, unless very few samples are taken).

Thank you, @Palli! You make excellent points as does the documentation. Regarding use of @time, many times, I benchmark something that is already compiled, and which I know takes a few ms. So I simply call @time a few times manually. It is very quick. Much quicker than waiting on @btime. But I understand using the minimum and median results, something I could easily calculate, and faster, calling @time 10 times. If a function takes 1 ms, then running 10,000 times (the default of BenchmarkingTools.jl) would take 10 sec! Of course, the number of samples can be changed, and I have done so.

Note that @btime automatically adjusts the number of samples. This makes sense since there are lots of fixed cost variation stuff like threading overhead.

BenchmarkTools allows you to specify a “time budget” for the benchmarking:

julia> BenchmarkTools.DEFAULT_PARAMETERS.seconds
5.0

julia> @time @benchmark sleep(1e-3)
 11.304928 seconds (1.36 M allocations: 68.068 MiB, 5.41% gc time)
BenchmarkTools.Trial: 
  memory estimate:  128 bytes
  allocs estimate:  5
  --------------
  minimum time:     1.061 ms (0.00% GC)
  median time:      2.158 ms (0.00% GC)
  mean time:        2.228 ms (0.00% GC)
  maximum time:     2.488 ms (0.00% GC)
  --------------
  samples:          2229
  evals/sample:     1

julia> BenchmarkTools.DEFAULT_PARAMETERS.seconds = 1.0
1.0

julia> @time @benchmark sleep(1e-3)
  2.709344 seconds (328.24 k allocations: 16.412 MiB, 18.44% gc time)
BenchmarkTools.Trial: 
  memory estimate:  128 bytes
  allocs estimate:  5
  --------------
  minimum time:     1.106 ms (0.00% GC)
  median time:      2.123 ms (0.00% GC)
  mean time:        2.130 ms (0.00% GC)
  maximum time:     2.195 ms (0.00% GC)
  --------------
  samples:          467
  evals/sample:     1

This basically allows you to say “I give you x seconds to do benchmarking, now go and give me the most accurate possible runtime estimate”. I think this is a very natural way of formulating a benchmarking problem.

1 Like

And here are my two cents on why I think @btime returns the smallest runtime. (I’m no benchmarking specialist, so maybe I’m wrong, but if so I’m sure some kind soul will point that out :slight_smile:)

I guess there are two primary reasons why people do benchmarking:

  1. Estimate how long your code will run if you repeat some operation many times.
  2. Optimise your code.

It is true that the minimum is probably a bad measure for the first purpose. In a real system, you will inevitably have noise, and you should incorporate that noise into your estimate or your estimate will fall short. For the purpose of code optimisation, however, you want to get a runtime estimate which is as reliable as possible in as short a time as possible, and taking the minimum is the easiest way to do that since any “outside” noise can only increase the runtime. Perhaps the one exception to this rule is garbage collection which is somewhat of a mix between outside noise and actual runtime, but then if you are in the business of performance optimisation you probably make sure that your performance-critical parts are allocation-free anyway.

1 Like

Another time minimum can cause issues is for stochastic algorithms. If you have unpredictable runtime, mean is probably the best measure.

1 Like