Help me Improve my poker hand evaluator

bennedich · January 8, 2019, 10:12pm

I have a different view on this than @ffevotte. Benchmarking is tricky to get right, and I do think that you need to take several precautions in order to do it accurately. I’ll break this answer down into two parts. (Note: All benchmarks in this answer are done on a 2.9 GHz Skylake CPU.)

1. "This time is ridiculous something must be wrong"

The reason for this too-fast-to-be-true benchmark:

julia> A = [hand5() for i in 1:100_000];

julia> @btime for n = $A handscore(n) end
  35.836 μs (0 allocations: 0 bytes)

is that you’re not doing anything with the result from handscore, so the compiler simply removes it, and the result becomes an empty loop. You can verify this by doing (trimmed to show inner loop only):

julia> code_native(A -> for n = A handscore(n) end, (Vector{UInt64},); syntax=:intel)
L48:
	add  rcx, 1
	add  rdx, 1
	mov  rsi, qword ptr [rax + 8]
	test rsi, rsi
	js   L70
	cmp  rdx, rsi
	jb   L48
L70:

In order to make sure that your method is called, I find that the safest way is to make it impossible for the compiler to cheat, by doing something with the result and returning it out of the benchmarked expression. Two options are:

julia> A = [hand5() for i in 1:100_000];

julia> @btime (s = UInt64(0); for n = eachindex($A) s += handscore(@inbounds $A[n]) end; s)
  1.057 ms (0 allocations: 0 bytes)
0x01f4838d89c10164

or:

julia> B = similar(A);

julia> @btime broadcast!(handscore, $B, $A);
   1.056 ms (0 allocations: 0 bytes)

As you can see, the time per execution is now ~10.6 ns per call. There’s a slight overhead from the iteration and memory accesses, but not much. You can get an idea of how much (~0.25 ns) by doing this:

julia> @btime broadcast!(identity, $B, $A);
  25.380 μs (0 allocations: 0 bytes)

2. Branch prediction and benchmarking

Now let's look at why this is so fast:

julia> h = hand5()
0x0000060008800004

julia> @btime handscore($h);
  3.925 ns (0 allocations: 0 bytes)

First of all, due to the way your handclass method works, the time to calculate the score varies depending on the hand. Let’s try another hand:

julia> h = hand5()
0x0008020000021001

julia> @btime handscore($h);
  6.039 ns (0 allocations: 0 bytes)

So if you want to test the overall performance of your algorithm, you should obviously test a wide range of hands. (Why not test all hands? There are only binomial(52,5) = 2,598,960 of them.)

This doesn’t fully explain it though, since testing a couple of hands, they all benchmark faster than 7 ns, so why did our previous test show ~10.6 ns? Running the same test with arrays smaller than 100k elements, we can see that the average time goes down:

julia> A = [hand5() for i in 1:1000]; B = similar(A);

julia> @btime broadcast!(handscore, $B, $A);
  8.091 μs (0 allocations: 0 bytes)

julia> A = [hand5() for i in 1:100]; B = similar(A);

julia> @btime broadcast!(handscore, $B, $A);
  491.902 ns (0 allocations: 0 bytes)

For 1000 elements, the average time per call is 8.1 ns. For 100 elements, it’s 4.9 ns. One might think that this is simply a result of caching and increased memory latency as the array grows. However, I don’t think this is an issue for data stored contiguously in memory; as we saw in the identity broadcast experiment above, the latency overhead when looping over 100k elements was only ~0.25 ns.

What I believe to be the explanation for this is a bit more intricate. First note that @btime works by running the same code over and over and over again and reporting the fastest time. Second, note that your code contains quite a bit of branching (if statements) depending on the input data. Third, recall that if a branch is predicted correctly, it is cheap, but a mispredicted branch typically incurs a cost of 5 - 10 ns. Fourth, realize that modern CPUs can remember patterns of several thousand branches and predict them correctly. (Cf. this topic.)

Put together, I believe that learned branch behavior explains the performance discrepancies we’re seeing, and that a more accurate performance estimate of your code is closer to ~10 ns.

Topic		Replies	Views
Fast String processing in Julia vs. Python General Usage	38	6884	March 7, 2017
Recursion base vs hand-made recursion Performance recursion , advent-of-code	13	237	December 29, 2024
Optimizing a puzzle from the Balatro game: discarding part of a poker hand to get a pair Performance probability	14	528	October 30, 2024
Funny Benchmark with Julia (no longer) at the bottom Performance benchmark	149	6054	November 4, 2023
PSA: Microbenchmarks remember branch history Performance benchmark	36	4965	October 1, 2019

Help me Improve my poker hand evaluator

1. "This time is ridiculous something must be wrong"

2. Branch prediction and benchmarking

Related topics