I have a different view on this than @ffevotte. Benchmarking is tricky to get right, and I do think that you need to take several precautions in order to do it accurately. I’ll break this answer down into two parts. (Note: All benchmarks in this answer are done on a 2.9 GHz Skylake CPU.)
1. "This time is ridiculous something must be wrong"
The reason for this too-fast-to-be-true benchmark:julia> A = [hand5() for i in 1:100_000];
julia> @btime for n = $A handscore(n) end
35.836 μs (0 allocations: 0 bytes)
is that you’re not doing anything with the result from handscore
, so the compiler simply removes it, and the result becomes an empty loop. You can verify this by doing (trimmed to show inner loop only):
julia> code_native(A -> for n = A handscore(n) end, (Vector{UInt64},); syntax=:intel)
L48:
add rcx, 1
add rdx, 1
mov rsi, qword ptr [rax + 8]
test rsi, rsi
js L70
cmp rdx, rsi
jb L48
L70:
In order to make sure that your method is called, I find that the safest way is to make it impossible for the compiler to cheat, by doing something with the result and returning it out of the benchmarked expression. Two options are:
julia> A = [hand5() for i in 1:100_000];
julia> @btime (s = UInt64(0); for n = eachindex($A) s += handscore(@inbounds $A[n]) end; s)
1.057 ms (0 allocations: 0 bytes)
0x01f4838d89c10164
or:
julia> B = similar(A);
julia> @btime broadcast!(handscore, $B, $A);
1.056 ms (0 allocations: 0 bytes)
As you can see, the time per execution is now ~10.6 ns per call. There’s a slight overhead from the iteration and memory accesses, but not much. You can get an idea of how much (~0.25 ns) by doing this:
julia> @btime broadcast!(identity, $B, $A);
25.380 μs (0 allocations: 0 bytes)
2. Branch prediction and benchmarking
Now let's look at why this is so fast:julia> h = hand5()
0x0000060008800004
julia> @btime handscore($h);
3.925 ns (0 allocations: 0 bytes)
First of all, due to the way your handclass
method works, the time to calculate the score varies depending on the hand. Let’s try another hand:
julia> h = hand5()
0x0008020000021001
julia> @btime handscore($h);
6.039 ns (0 allocations: 0 bytes)
So if you want to test the overall performance of your algorithm, you should obviously test a wide range of hands. (Why not test all hands? There are only binomial(52,5)
= 2,598,960 of them.)
This doesn’t fully explain it though, since testing a couple of hands, they all benchmark faster than 7 ns, so why did our previous test show ~10.6 ns? Running the same test with arrays smaller than 100k elements, we can see that the average time goes down:
julia> A = [hand5() for i in 1:1000]; B = similar(A);
julia> @btime broadcast!(handscore, $B, $A);
8.091 μs (0 allocations: 0 bytes)
julia> A = [hand5() for i in 1:100]; B = similar(A);
julia> @btime broadcast!(handscore, $B, $A);
491.902 ns (0 allocations: 0 bytes)
For 1000 elements, the average time per call is 8.1 ns. For 100 elements, it’s 4.9 ns. One might think that this is simply a result of caching and increased memory latency as the array grows. However, I don’t think this is an issue for data stored contiguously in memory; as we saw in the identity broadcast experiment above, the latency overhead when looping over 100k elements was only ~0.25 ns.
What I believe to be the explanation for this is a bit more intricate. First note that @btime
works by running the same code over and over and over again and reporting the fastest time. Second, note that your code contains quite a bit of branching (if statements) depending on the input data. Third, recall that if a branch is predicted correctly, it is cheap, but a mispredicted branch typically incurs a cost of 5 - 10 ns. Fourth, realize that modern CPUs can remember patterns of several thousand branches and predict them correctly. (Cf. this topic.)
Put together, I believe that learned branch behavior explains the performance discrepancies we’re seeing, and that a more accurate performance estimate of your code is closer to ~10 ns.