I benchmarked a simple loop of 1,000,000 iterations using both push! and preallocation, both using functions and within the global space, and using const versus non-const arrays when executing in the global space. I got some unexpected results, which perhaps somebody can explain.
a = 
for i in 1:1000000
begin # requires an indent in juno
println("-----------"); for i in 1:5
println("Execution of function alloc_push()")
println("Cost is about 0.03 sec, with 1M allocs of total 24MBytes")
const bconst = zeros(Int, 1000000)
bnonconst = zeros(Int, 1000000)
for i in 1:1000000
b[i] = i
# No allocations whether b
for j in 1:5 @time alloc_pre(bconst); end
for j in 1:5 @time alloc_pre(bnonconst); end
end; println("function call to alloc_pre(): memory is preallocated")
println("If argument is const, time is 0.002 sec")
println("If argument is non-const, time is 0.0002 sec")
println("If I increase array size by 10x, all timings increase by 10x")
println("WHY IS CONST slower?")
# 2M eallocationsif `b` is not a const
for j in 1:5
@time for i in 1:1000000 bconst[i] = i; end
for j in 1:5
@time for i in 1:1000000 bnonconst[i] = i; end
println("Running in global space, 0.002 sec with const array, with no allocations")
println(" 0.04 sec with non-const array, with 2M allocations of total 30.5MBytes with gc")
println("Execution outside function in global space: ")
println(" const is 10x faster as expected")
To summarize (also contained within the code):
When executing in the global space, with pre-allocated arrays, using const arrays executes without allocations and 10x faster than the non-const version. That is what I expected.
When executing the loops within a function with the array as an argument, and this array can be const or non-const, the results are stranger: whether I use a non-const or a const array, I get no memory allocation. Moreover, using a const array runs 10x slower than using a non-const array. Why would that be?
I am running on a Macbook Pro notebook, on the CPU, with 16 MB of RAM, so memory is not the issue. Can anybody duplicate these results? My goal is to gain a better understanding of when memory is allocated, what I can control and what I cannot, and how to use these findings in code where speed is important.
You’re benchmarking a lot of different things here, and I think the particular way you’re setting up the benchmarks is what’s giving the weird results, not any actual difference in performance.
First of all, there’s no need to run @time repeatedly. Instead, use @btime from BenchmarkTools.jl which will automatically run your function multiple times and ensure you are not measuring compile time.
With that, we can see that alloc_pre does not actually show a 10X difference between a const and non-const global variable as input:
Your points are well taken, and of course, I did check with @btime. However, why is it that @btime returns a single number as opposed to a min/max timings? Also, how many timings does @btime execute? Maybe something is wrong with my system. Using the same functions as above (I have not yet implemented your suggested changes):
I just noticed the difference in units. But I still have to ask: why aren’t the results of @time valid? I execute my loop 5 times. Surely, the 2nd execution should produce a time similar to @btime? In normal code, execution happens only once (once the code is compiled). Also, @btime would be nice, if execution was, say 5x slower than one execution of @time. But that is not the case. What does @btime do that is so expensive computationally?
More precisely, how is this explained?
julia> @time @btime alloc_push()
17.749 ms (999509 allocations: 24.25 MiB)
11.136745 seconds (422.11 M allocations: 10.010 GiB, 25.31% gc time)
0.017 sec using @btime, and 11 sec when timing @btime!!! With these kinds of timings, using @btime is not practical. Something must be amiss.
I love the documentation! Thank you. It is rather surprising that BenchmarkTools does not provide a standard deviation, which would be very useful as an indicator of System tools. I notice that loop timings can vary up to factor of 2x under a controlled environment. Isn’t that rather high?
I still have not heard an explanation for why @benchmark takes so long. If I have a loop sequence that takes 88 microsec, and I set
BenchmarkTools.DEFAULT_PARAMETERS.samples = 10, why is it that @btime takes 6 seconds to run? I understand that there is overhead, but there should be more than ample time to set up whatever structures are necessary. Running @btime twice in a role takes the same amount of time. My guess then is that the time has to do with the macro expansion time. Any insights?
Surely, one could write a non-macro version that runs @time 10 times, and computes the statistics. Of course, we would not get the other features offered by the BenchmarkTools package.
Welcome to the world of turbo boosting and throttling of CPU speed. I think difficulties (“2x”) is just to be expected. Timing and predicting speed of code/optimiszations (even -O3 on SPEC thought to be faster was at some point shown to be not faster than -O2) used to be much easier before (and before caches).
@benchmark and @btime run many “samples”, and each sample consists of many “evaluations” (terminology borrowed from the manual). So you’re still running the function far more than just 10 times, which is why it takes so long.
There’s also “warmup” accounted to in the BenchmarkTools code.
It’s a can of worms, I didn’t read that 7-page long paper on it (most issues are shared by other languages):
Consecutive timing measurements can fluctuate, possibly in a correlated manner, in ways which depend on a myriad of factors such as environment temperature, workload, power availability, and network traffic, and […]
Many factors stem from OS behavior, including CPU frequency scaling , address space layout randomization (ASLR) , virtual memory manage-ment , , differences between CPU privilege levels , context switches due to interrupt handling , activity from […]
Authors have also noted the poor [statistical] power of standard techniques such as F-tests or Student t-tests for benchmark timings , , –. Parametric outlier detection techniques, such as the 3-sigma rule used in benchmarking software like AndroBench, can also fail when applied to non-i.i.d. timing measurements.There is a lack of consensus over how non-ideal timing measurements should be treated. […]
To the best of our knowledge, our work is the first benchmarking methodology that can be fully automated, is robust in its assumption of non-i.i.d. timing measurement statistics, and makes efficient use of a limited time budget.
Yet, comparing the times above, for all statistics pre-allocating the array is slightly worse, even though we’re passing the compiler more knowledge upfront. This didn’t sit well with me, so I consulted the BenchmarkTools.jl manual […]
If you can avoid garbage collection, using six threads here gives nearly a 10x speedup, and at the median where both single-threaded and multi-threaded trigger garbage collection you still get a 2x speedup.
What do you mean by warmup? The compilation? Or something else?
But in any case, if one uses @time 10 times, and throws away the first one, the timings are rather consistent with each other, as I would expect them to be.
I definitely think that selecting the minimum timing returned by @benchmark is not reasonable, although it often makes an algorithm look good. It would be a timing under ideal conditions with no additional system processes going on. That scenario is unlikely when running the program in practice. It also makes a comparison between languages quite difficult. I have quite a bit of experience in the past with several languages.
I still do not see the advantages of @benchmark versus running @time x times, apart from the fact that @benchmark is a macro, part of a package with many options.
The tool has no control over “additional” processes, so it wouldn’t be the absolute minimum. But yes, the point you’re making is a bit valid, as I said have seen it made elsewhere that the minimum shouldn’t be used (but that was for some very pathological case). I would then look at the median (not average) instead. Both (and all) estimators tell some story (see the doc I linked), and also, you’re not timing an algorithm, just an implementation of (yes, what you meant, that’s only a technical point, they are analyzed by computer scientists, and constant factor ignored, while it’s very important for benchmarking).
The timing calls the tune! function, but I didn’t locate if the also exported warmup is called implicitly:
Thank you, @Palli! You make excellent points as does the documentation. Regarding use of @time, many times, I benchmark something that is already compiled, and which I know takes a few ms. So I simply call @time a few times manually. It is very quick. Much quicker than waiting on @btime. But I understand using the minimum and median results, something I could easily calculate, and faster, calling @time 10 times. If a function takes 1 ms, then running 10,000 times (the default of BenchmarkingTools.jl) would take 10 sec! Of course, the number of samples can be changed, and I have done so.
BenchmarkTools allows you to specify a “time budget” for the benchmarking:
julia> @time @benchmark sleep(1e-3)
11.304928 seconds (1.36 M allocations: 68.068 MiB, 5.41% gc time)
memory estimate: 128 bytes
allocs estimate: 5
minimum time: 1.061 ms (0.00% GC)
median time: 2.158 ms (0.00% GC)
mean time: 2.228 ms (0.00% GC)
maximum time: 2.488 ms (0.00% GC)
julia> BenchmarkTools.DEFAULT_PARAMETERS.seconds = 1.0
julia> @time @benchmark sleep(1e-3)
2.709344 seconds (328.24 k allocations: 16.412 MiB, 18.44% gc time)
memory estimate: 128 bytes
allocs estimate: 5
minimum time: 1.106 ms (0.00% GC)
median time: 2.123 ms (0.00% GC)
mean time: 2.130 ms (0.00% GC)
maximum time: 2.195 ms (0.00% GC)
This basically allows you to say “I give you x seconds to do benchmarking, now go and give me the most accurate possible runtime estimate”. I think this is a very natural way of formulating a benchmarking problem.
And here are my two cents on why I think @btime returns the smallest runtime. (I’m no benchmarking specialist, so maybe I’m wrong, but if so I’m sure some kind soul will point that out )
I guess there are two primary reasons why people do benchmarking:
Estimate how long your code will run if you repeat some operation many times.
Optimise your code.
It is true that the minimum is probably a bad measure for the first purpose. In a real system, you will inevitably have noise, and you should incorporate that noise into your estimate or your estimate will fall short. For the purpose of code optimisation, however, you want to get a runtime estimate which is as reliable as possible in as short a time as possible, and taking the minimum is the easiest way to do that since any “outside” noise can only increase the runtime. Perhaps the one exception to this rule is garbage collection which is somewhat of a mix between outside noise and actual runtime, but then if you are in the business of performance optimisation you probably make sure that your performance-critical parts are allocation-free anyway.