Great solution (and website)!
Some other ideas. First, if you could pack your bits as 4x16 (with 3 bits padding) instead of 4x13, you could probably speed things up a bit.
Second, consider a look-up-table. You can use any number of bits for the table. I’ve chosen 13 below, quite arbitrarily, since it divides 52, but other sizes might perform better depending on how it fits into cache:
const PERM13 = zeros(Int64, 4*2^13)
for p = 0:3, n = 0:2^13-1
PERM13[n+1 + p*2^13] = uninterleavenibble52(n << (p*13))
end
function permute_lookup(x)
x & 0xfff0000000000000 | # remove this if you don't need to preserve the top 12 bits
PERM13[1 + (x & 0b1111111111111)] |
PERM13[8193 + ((x >> 13) & 0b1111111111111)] |
PERM13[16385 + ((x >> 26) & 0b1111111111111)] |
PERM13[24577 + ((x >> 39) & 0b1111111111111)]
end
In benchmarks this performs great, presumably because it fits nicely in cache, but in practice that might not be the case, so you’d need to test it out.
Third, consider using the pext
instruction available on Haswell and newer Intel processors (after 2013) to greatly simplify/speed up the permutation:
@inline pext(x::UInt64, y::UInt64) = ccall("llvm.x86.bmi.pext.64", llvmcall, UInt64, (UInt64, UInt64), x, y)
function permute_pext(x)
x & 0xfff0000000000000 | # remove this if you don't need to preserve the top 12 bits
pext(x, 0x1111111111111) |
pext(x, 0x2222222222222) << 13 |
pext(x, 0x4444444444444) << 26 |
pext(x, 0x8888888888888) << 39
end
To figure out if the pext
instruction is available, you could do something like this:
CPUInfo = zeros(Int32, 4)
ccall(:jl_cpuidex, Cvoid, (Ptr{Cint}, Cint, Cint), CPUInfo, 7, 0)
has_pext = CPUInfo[2] & 0x100 != 0
Sample timings below. Note that I’ve put @noinline
in front of all permute functions to prevent inlining.
julia> using BenchmarkTools
julia> A = rand(UInt64, 100000);
julia> @btime (for n = $A uninterleavenibble52(n) end)
702.064 μs (0 allocations: 0 bytes)
julia> @btime (for n = $A permute_lookup(n) end)
371.155 μs (0 allocations: 0 bytes)
julia> @btime (for n = $A permute_pext(n) end)
200.630 μs (0 allocations: 0 bytes)
This corresponds to ~20.4 cycles for the bit permute implementation, ~10.8 cycles for the look-up-table
and ~5.8 cycles for the pext implementation.