Help me Improve my poker hand evaluator

optimization

#1

Hi,
I am making a poker hand evaluator for fun.
the deck of card is represent by the 52 first bits (low value) of a UInt64 like so:
0000/0000/0000/AAAA/KKKK/QQQQ/JJJJ/10,10,10,10/9999/8888/7777/6666/5555/4444/3333/2222
the suits logic is the same for every rank (for example /♣, ♢, ♡, ♠/)

i.e.

0x0008000000008888 # would be a straight flush "A♣,2♣,3♣,4♣,5♣"
0x000000000000f002 # would be a 4 of a kind    "5♣,5♢,5♡,5♠,2♡"
using BenchmarkTools

function hand5() #trigger 5 of the 53 first bits of a UInt64 and return it (generate a 5 cards poker hand)
    a=UInt64(1)<<rand(UInt8(0):UInt8(47))
    for i in UInt8(0):UInt8(3)
        a=a<<UInt8(1) | UInt64(1)
        r=rand(UInt8(0):UInt8(48+i))
        a= ~UInt64(0)>>(UInt8(15)-i) & a<<r | a>>(UInt8(49)+i-r)
    end
    return a
end

@inline pext(x::UInt64, y::UInt64) = ccall("llvm.x86.bmi.pext.64", llvmcall, UInt64, (UInt64, UInt64), x, y)#intel pext

@inline function packnibble(x::UInt64) #pack the bits to the right of every nibble in a UInt64
    x -= x>>1 & ~x & 0x7777777777777777 # 0b10  ↦ 0b01
    x -= x>>1 & ~x & 0x7777777777777777
    x -= x>>1 & ~x & 0x7777777777777777
    return x
end

@inline function handclass(x::UInt64,y::UInt64) #return a number for the type of poker hand, x is a raw hand , y is nibble packed x
    x= x >> trailing_zeros(x)
    r= (y-5)%15 # return the type of hand but 0 for straight/flush/highcard
    if r != 0
        return r << 2 #*4 (we need more values in-between to insert the flush/straight)
    elseif x == 0x0000000000011111 #straightflush
        return 0x000000000000002e
    elseif x == 0x0001000000001111 # weak straightflush
        return 0x000000000000002d
    elseif count_ones(pext(x, 0x1111111111111)) == 5 #flush
        return 0x0000000000000013
    elseif y >> trailing_zeros(y) == 0x0000000000011111 #straight
        return 0x0000000000000012
    elseif y == 0x0001000000001111 #weak straight
        return 0x0000000000000011
    else
        return 0x0000000000000000 #high card
    end
end

@inline function handscore(hand)#UInt64 handclass/cards appearing 3times or more/>=2times/>=1time (the bigger the UInt64, the better poker hand)
        y=packnibble(hand)
        handclass(hand,y) << 39 | pext(y, 0x4444444444444) << 26 | pext(y, 0x2222222222222) << 13 | pext(y, 0x1111111111111)
end

A=[hand5() for i in 1:100000000]#100.000.000 hands

@btime B=map(handscore,$A)
1.531 s (3 allocations: 762.94 MiB)

@btime (for n = $A handscore(n) end)
37.881 ms (0 allocations: 0 bytes) (this time is ridiculous something must be wrong)

I put @inline everywhere because I don’t quite understand when to use it or not.

I tried to use @profile but since all the functions are so small it didn’t help much (surely I am doing the wrong way).

I know there is already very efficient way to compare poker hands out there, they are a bit hard to understand for me. https://archive.ph/wM0ER with the files https://github.com/tangentforks/XPokerEval


How to shift bits faster
#2

Are you familiar with https://github.com/StefanKarpinski/Cards.jl
might make your implementation smoother?


#3

The main difference between your two benchmarks is that in the first, you allocate a 1_000_000-element vector to store results, which means that additional memory allocations are needed and perturb benchmarked times. In contrast, the second benchmark discards hand scores as soon as they are computed, which opens the way to many optimizations which might perturb benchmarks as well.

Trying to run the benchmark for different numbers of hands, you would expect benchmarked times to increase linearly with the number of hands scored. In first benchmark, this is not exactly the case, since benchmarked times seem to be perturbed by memory allocations.

import Statistics: mean

let N=100
    while N <= 100_000_000
        A = [hand5() for i in 1:N]
        b = @benchmark handscore.($A)
        a = b.allocs
        t = mean(b.times) / N
        println("$N\t$a\t$t")
        N *= 10
    end
end
# N _hands      allocs  time/hand
100             1       7.094135642424243
1000            1       6.897472639999999
10000           2       12.96812153
100000          2       14.00562684624018
1000000         2       14.042011233146068
10000000        2       14.427071054285713
100000000       2       16.705365596666667

However, assuming that memory allocations are mostly negligible for small numbers of hands, we might conclude that your scoring function takes around 6-8 ns per hand. All this was perhaps not necessary, since BenchmarkTools is smart enough to provide such an estimation by itself, without the additional loop:

julia> @btime handscore($h)
  6.830 ns (0 allocations: 0 bytes)

In conclusion, I would personally put more trust in results provided by a straightforward use of BenchmarkTools, rather than results obtained in a more convoluted way.


#4

@baggepinnen very cool stuff indeed, I might use it but my plan is to try to learn neuronal network and AI so nobody gonna play the cards anyway :wink:

@ffevotte I guess 6-8 ns per hand is fast enough however here is my attempt for

A 7 cards hand

using BenchmarkTools

@inline pext(x::UInt64, y::UInt64) = ccall("llvm.x86.bmi.pext.64", llvmcall, UInt64, (UInt64, UInt64), x, y)
@inline pdep(x::UInt64, y::UInt64) = ccall("llvm.x86.bmi.pdep.64", llvmcall, UInt64, (UInt64, UInt64), x, y)

@inline function packnibble(x::UInt64)
    x -= x>>1 & ~x & 0x7777777777777777
    x -= x>>1 & ~x & 0x7777777777777777
    x -= x>>1 & ~x & 0x7777777777777777
    return x
end

function hand7()
    a=UInt64(1)<<rand(UInt8(0):UInt8(45))
    for i in UInt8(0):UInt8(5)
        a=a<<UInt8(1) | UInt64(1)
        r=rand(UInt8(0):UInt8(46+i))
        a= ~UInt64(0)>>(UInt8(17)-i) & a<<r | a>>(UInt8(47)+i-r)
    end
    return a
end

@inline function handclass(x,y)
    x= x >> trailing_zeros(x)
    r= (y-5)%15
    if r != 0
        return r << 2
    elseif x == 0x0000000000011111
        return 0x000000000000002d
    elseif x == 0x0001000000001111
        return 0x000000000000002e
    elseif count_ones(pext(x, 0x1111111111111)) == 5
        return 0x0000000000000013
    elseif y >> trailing_zeros(y) == 0x0000000000011111
        return 0x0000000000000012
    elseif y == 0x0001000000001111
        return 0x0000000000000011
    else
        return 0x0000000000000000
    end
end

@inline function handscore(hand)
        y=packnibble(hand)
        pext(y, 0x1111111111111) | pext(y, 0x2222222222222) << 13 | pext(y, 0x4444444444444) << 26 | handclass(hand,y) << 39
end

const perm=UInt64[0x000000000000001f,0x000000000000002f,0x0000000000000037,
    0x000000000000003b,0x000000000000003d,0x000000000000003e,
    0x000000000000004f,0x0000000000000057,0x000000000000005b,
    0x000000000000005d,0x000000000000005e,0x0000000000000067,
    0x000000000000006b,0x000000000000006d,0x000000000000006e,
    0x0000000000000073,0x0000000000000075,0x0000000000000076,
    0x0000000000000079,0x000000000000007a,0x000000000000007c]

@inline function besthand(h,perm)
    best=UInt64(0)
    score=UInt64(0)
    for i in 1:21
        score = handscore(pdep(perm[i],h))
        if score > best
            best = score
        end
    end
    return best
end


A=[hand7() for i in 1:1000000]

@btime (for n = A besthand(n,perm) end)
66.625 ms (2999490 allocations: 61.03 MiB)

Is that anyway to get a little speedup?

thank you


#5

I have a different view on this than @ffevotte. Benchmarking is tricky to get right, and I do think that you need to take several precautions in order to do it accurately. I’ll break this answer down into two parts. (Note: All benchmarks in this answer are done on a 2.9 GHz Skylake CPU.)

1. "This time is ridiculous something must be wrong"

The reason for this too-fast-to-be-true benchmark:
julia> A = [hand5() for i in 1:100_000];

julia> @btime for n = $A handscore(n) end
  35.836 μs (0 allocations: 0 bytes)

is that you’re not doing anything with the result from handscore, so the compiler simply removes it, and the result becomes an empty loop. You can verify this by doing (trimmed to show inner loop only):

julia> code_native(A -> for n = A handscore(n) end, (Vector{UInt64},); syntax=:intel)
L48:
	add  rcx, 1
	add  rdx, 1
	mov  rsi, qword ptr [rax + 8]
	test rsi, rsi
	js   L70
	cmp  rdx, rsi
	jb   L48
L70:

In order to make sure that your method is called, I find that the safest way is to make it impossible for the compiler to cheat, by doing something with the result and returning it out of the benchmarked expression. Two options are:

julia> A = [hand5() for i in 1:100_000];

julia> @btime (s = UInt64(0); for n = eachindex($A) s += handscore(@inbounds $A[n]) end; s)
  1.057 ms (0 allocations: 0 bytes)
0x01f4838d89c10164

or:

julia> B = similar(A);

julia> @btime broadcast!(handscore, $B, $A);
   1.056 ms (0 allocations: 0 bytes)

As you can see, the time per execution is now ~10.6 ns per call. There’s a slight overhead from the iteration and memory accesses, but not much. You can get an idea of how much (~0.25 ns) by doing this:

julia> @btime broadcast!(identity, $B, $A);
  25.380 μs (0 allocations: 0 bytes)

2. Branch prediction and benchmarking

Now let's look at why this is so fast:
julia> h = hand5()
0x0000060008800004

julia> @btime handscore($h);
  3.925 ns (0 allocations: 0 bytes)

First of all, due to the way your handclass method works, the time to calculate the score varies depending on the hand. Let’s try another hand:

julia> h = hand5()
0x0008020000021001

julia> @btime handscore($h);
  6.039 ns (0 allocations: 0 bytes)

So if you want to test the overall performance of your algorithm, you should obviously test a wide range of hands. (Why not test all hands? There are only binomial(52,5) = 2,598,960 of them.)

This doesn’t fully explain it though, since testing a couple of hands, they all benchmark faster than 7 ns, so why did our previous test show ~10.6 ns? Running the same test with arrays smaller than 100k elements, we can see that the average time goes down:

julia> A = [hand5() for i in 1:1000]; B = similar(A);

julia> @btime broadcast!(handscore, $B, $A);
  8.091 μs (0 allocations: 0 bytes)

julia> A = [hand5() for i in 1:100]; B = similar(A);

julia> @btime broadcast!(handscore, $B, $A);
  491.902 ns (0 allocations: 0 bytes)

For 1000 elements, the average time per call is 8.1 ns. For 100 elements, it’s 4.9 ns. One might think that this is simply a result of caching and increased memory latency as the array grows. However, I don’t think this is an issue for data stored contiguously in memory; as we saw in the identity broadcast experiment above, the latency overhead when looping over 100k elements was only ~0.25 ns.

What I believe to be the explanation for this is a bit more intricate. First note that @btime works by running the same code over and over and over again and reporting the fastest time. Second, note that your code contains quite a bit of branching (if statements) depending on the input data. Third, recall that if a branch is predicted correctly, it is cheap, but a mispredicted branch typically incurs a cost of 5 - 10 ns. Fourth, realize that modern CPUs can remember patterns of several thousand branches and predict them correctly. (Cf. this topic.)

Put together, I believe that learned branch behavior explains the performance discrepancies we’re seeing, and that a more accurate performance estimate of your code is closer to ~10 ns.


#6

Thank you for your (as always) insightful answer.

Just to be sure I understand: are you saying that somewhere between 100 and 1000 hands, the total number of branches to remember (i.e. the total number of branches in all scores computations) gets too large for the branch predictor to keep in memory? At this point, we start seeing benchmarked times which are more accurate, because they are not artificially “helped” by the hot branch predictor?

I was mislead by the fact that (on my system and with the benchmark I was running), this points happened to coincide with an additional memory allocation. In retrospect, I see I was wrong to attribute this effect to memory allocations: why would an additional allocation double the computing time? In any case, I would never have thought a hot branch predictor to be able to affect computing times so much (other than for some repeated loops with a fixed number of iterations, or other specific --but also very common-- cases). Thanks again!


#7

Yes, exactly. At least on my system (Intel Skylake). You can read more about this in the topic I linked above. On AMD Ryzen for example, it appears that even more branches can be remembered.


#8

I improve my code,

using BenchmarkTools

function hand5() #trigger 5 of the 53 first bits of a UInt64 and return it (a 5 card poker hand)
    a=UInt64(1)<<rand(UInt8(0):UInt8(47))
    for i in UInt8(0):UInt8(3)
        a=a<<UInt8(1) | UInt64(1)
        r=rand(UInt8(0):UInt8(48+i))
        a= ~UInt64(0)>>(UInt8(15)-i) & a<<r | a>>(UInt8(49)+i-r)
    end
    return a
end

@inline pext(x::UInt64, y::UInt64) = ccall("llvm.x86.bmi.pext.64", llvmcall, UInt64, (UInt64, UInt64), x, y)

@inline function packnibble(x::UInt64) #pack the bits to the right of every nibble in a UInt64
    x -= x>>1 & ~x & 0x7777777777777777 # 0b10  ↦ 0b01
    x -= x>>1 & ~x & 0x7777777777777777
    x -= x>>1 & ~x & 0x7777777777777777
    return x
end

@inline function handscore(hand::UInt64)#UInt64 handclass/3time/2time/1time (the bigger the UInt64, the better poker hand)
        pckhand = packnibble(hand)
        hand >>= (trailing_zeros(hand)&63)
        handclass = ((pckhand-5)%15)<<41
        pckhand == 0x0001000000001111 && (handclass = 0x0000088000000000)
        pckhand >> (trailing_zeros(pckhand)&63) == 0x0000000000011111 && (handclass = 0x0000090000000000)
        hand == hand & 0x1111111111111 && (handclass = 0x0000098000000000 + handclass<<1)
        handclass | pext(pckhand, 0x4444444444444) << 26 | pext(pckhand, 0x2222222222222) << 13 | pext(pckhand, 0x1111111111111)
end
A=[hand5() for i in 1:100000000]
@btime B=map(handscore,$A)
968.919 ms (3 allocations: 762.94 MiB)

@bennedich, Strangly if I put an if:

@inline function handscoreif(hand::UInt64)#UInt64 handclass/3time/2time/1time (the bigger the UInt64, the better poker hand)
        pckhand = packnibble(hand)
        hand >>= (trailing_zeros(hand)&63)
        handclass = ((pckhand-5)%15)<<41
        if handclass == 0x0000000000000000
            pckhand == 0x0001000000001111 && (handclass = 0x0000088000000000)
            pckhand >> (trailing_zeros(pckhand)&63) == 0x0000000000011111 && (handclass = 0x0000090000000000)
            hand == hand & 0x1111111111111 && (handclass = 0x0000098000000000 + handclass<<1)
        end
        handclass | pext(pckhand, 0x4444444444444) << 26 | pext(pckhand, 0x2222222222222) << 13 | pext(pckhand, 0x1111111111111)
end

It is slower, I would expect the opposite since It should save many operations “most” of the time…
mystery hahaha

A=[hand5() for i in 1:100000000]
@btime B=map(handscoreif,$A)
1.521 s (3 allocations: 762.94 MiB)

can I use @simd or some other tricks to further improve this?

thank you


#9

oops,I didn’t use proper benchmark,

A = [hand5() for i in 1:100_000]
@btime (s = UInt64(0); for n = eachindex($A) s += handscore(@inbounds $A[n]) end; s)
523.000 μs (0 allocations: 0 bytes)

around 200M hands/second it is not bad!

and the if version (strangely slower) :

@btime (s = UInt64(0); for n = eachindex($A) s += handscoreif(@inbounds $A[n]) end; s)
993.099 μs (0 allocations: 0 bytes)

#10

Here is my last version,

@inline function handscore(hand::UInt64)#UInt64 handclass/3time/2time/1time (the bigger the UInt64, the better poker hand)
        pckhand = packnibble(hand)
        pckhand != pckhand & 0x1111111111111 && return ((pckhand-0x0000000000000005)%0x000000000000000f)<<41 | pext(pckhand, 0x4444444444444) << 26 | pext(pckhand, 0x2222222222222) << 13 | pext(pckhand, 0x1111111111111)
        hand >>= (trailing_zeros(hand)&63)
        handclass = pext(pckhand, 0x1111111111111)
        hand == hand & 0x1111111111111 && (handclass = handclass | 0x0000098000000000)
        pckhand >> (trailing_zeros(pckhand)&63) == 0x0000000000011111 && return (0x0000090000000000 + handclass<<1)
        pckhand == 0x0001000000001111 && return (0x0000088000000000 + handclass<<1)
        return handclass
end

However after several benchmarking I conclude that it is ~ 4 ns/hand

But the @btime does seems to be misleading on this one

@bennedich This is very tricky to have proper benchmark indeed!

alright I think it cannot be faster so I am quitting :wink:

Thanks you all for for your answers on this post, and the others!


#11

I coudn’t sleep so I improve it again (the devil got me :sweat_smile:)

I change the previous function packnibble(x::UInt64) by

@inline function sumnibble(x::UInt64)
        x = x - x >> 1 & 0x5555555555555
        x = x & 0x3333333333333 + x >> 2 & 0x3333333333333
end

witch save “8 instructions”

here is my new function:

@inline function test(hand::UInt64)#UInt64 handclass/square/trice/deuce(the bigger the UInt64, the better poker hand)
        sumhand = sumnibble(hand)
        handclass = ((sumhand - (sumhand|sumhand>>1) & 0x1111111111111)%0x000000000000000f) << 41
        hand >>= (trailing_zeros(hand)&63)
        sumhand >> (trailing_zeros(sumhand)&63) == 0x0000000000011111 && (handclass = 0x0000050000000000)
        sumhand == 0x0001000000001111 && (handclass = 0x0000048000000000)
        hand == hand & 0x1111111111111 && (handclass += 0x0000058000000000)
        return handclass | pext(sumhand + sumhand>>1, 0x4444444444444) << 26 | pext(sumhand, 0x2222222222222) << 13 | pext(sumhand, 0x1111111111111)
end
A=[hand5() for i in 1:1000_000]
@btime (s = UInt64(0); for n = eachindex($A) s += test(@inbounds $A[n]) end; s)
4.691 ms (0 allocations: 0 bytes)

And the “optimized” version:

@inline function handscore(hand::UInt64)#UInt64 handclass/3time/2time/1time (the bigger the UInt64, the better poker hand)
        sumhand = sumnibble(hand)
        sumhand != sumhand & 0x1111111111111 && return ((sumhand-(sumhand|sumhand>>1) & 0x1111111111111)%0x000000000000000f) << 41 | pext(sumhand + 0x1111111111111, 0x4444444444444) << 26 | pext(sumhand, 0x2222222222222) << 13 | pext(sumhand, 0x1111111111111)
        hand >>= (trailing_zeros(hand)&63)
        handclass = pext(sumhand, 0x1111111111111)
        hand == hand & 0x1111111111111 && (handclass = handclass | 0x0000058000000000)
        sumhand >> (trailing_zeros(sumhand)&63) == 0x0000000000011111 && return (0x0000050000000000 + handclass)
        sumhand == 0x0001000000001111 && return (0x0000048000000000 + handclass)
        return handclass
end

Witch once again give me strangely bad benchmarks but Is faster in my actual application.
Less than 4ns / hand.


#12

Neat, but is this the bottleneck in your application?