Interleaving bits (Z-order curve,morton code).Transpose binary matrix

Great solution (and website)!

Some other ideas. First, if you could pack your bits as 4x16 (with 3 bits padding) instead of 4x13, you could probably speed things up a bit.

Second, consider a look-up-table. You can use any number of bits for the table. I’ve chosen 13 below, quite arbitrarily, since it divides 52, but other sizes might perform better depending on how it fits into cache:

const PERM13 = zeros(Int64, 4*2^13)

for p = 0:3, n = 0:2^13-1
    PERM13[n+1 + p*2^13] = uninterleavenibble52(n << (p*13))
end

function permute_lookup(x)
    x & 0xfff0000000000000 | # remove this if you don't need to preserve the top 12 bits
        PERM13[1 + (x & 0b1111111111111)] |
        PERM13[8193 + ((x >> 13) & 0b1111111111111)] |
        PERM13[16385 + ((x >> 26) & 0b1111111111111)] |
        PERM13[24577 + ((x >> 39) & 0b1111111111111)]
end

In benchmarks this performs great, presumably because it fits nicely in cache, but in practice that might not be the case, so you’d need to test it out.

Third, consider using the pext instruction available on Haswell and newer Intel processors (after 2013) to greatly simplify/speed up the permutation:

@inline pext(x::UInt64, y::UInt64) = ccall("llvm.x86.bmi.pext.64", llvmcall, UInt64, (UInt64, UInt64), x, y)

function permute_pext(x)
    x & 0xfff0000000000000 | # remove this if you don't need to preserve the top 12 bits
        pext(x, 0x1111111111111) |
        pext(x, 0x2222222222222) << 13 |
        pext(x, 0x4444444444444) << 26 |
        pext(x, 0x8888888888888) << 39
end

To figure out if the pext instruction is available, you could do something like this:

CPUInfo = zeros(Int32, 4)
ccall(:jl_cpuidex, Cvoid, (Ptr{Cint}, Cint, Cint), CPUInfo, 7, 0)
has_pext = CPUInfo[2] & 0x100 != 0

Sample timings below. Note that I’ve put @noinline in front of all permute functions to prevent inlining.

julia> using BenchmarkTools

julia> A = rand(UInt64, 100000);

julia> @btime (for n = $A uninterleavenibble52(n) end)
  702.064 μs (0 allocations: 0 bytes)

julia> @btime (for n = $A permute_lookup(n) end)
  371.155 μs (0 allocations: 0 bytes)

julia> @btime (for n = $A permute_pext(n) end)
  200.630 μs (0 allocations: 0 bytes)

This corresponds to ~20.4 cycles for the bit permute implementation, ~10.8 cycles for the look-up-table
and ~5.8 cycles for the pext implementation.