Interleaving bits (Z-order curve,morton code).Transpose binary matrix

bennedich · December 13, 2018, 5:10pm

Great solution (and website)!

Some other ideas. First, if you could pack your bits as 4x16 (with 3 bits padding) instead of 4x13, you could probably speed things up a bit.

Second, consider a look-up-table. You can use any number of bits for the table. I’ve chosen 13 below, quite arbitrarily, since it divides 52, but other sizes might perform better depending on how it fits into cache:

const PERM13 = zeros(Int64, 4*2^13)

for p = 0:3, n = 0:2^13-1
    PERM13[n+1 + p*2^13] = uninterleavenibble52(n << (p*13))
end

function permute_lookup(x)
    x & 0xfff0000000000000 | # remove this if you don't need to preserve the top 12 bits
        PERM13[1 + (x & 0b1111111111111)] |
        PERM13[8193 + ((x >> 13) & 0b1111111111111)] |
        PERM13[16385 + ((x >> 26) & 0b1111111111111)] |
        PERM13[24577 + ((x >> 39) & 0b1111111111111)]
end

In benchmarks this performs great, presumably because it fits nicely in cache, but in practice that might not be the case, so you’d need to test it out.

Third, consider using the pext instruction available on Haswell and newer Intel processors (after 2013) to greatly simplify/speed up the permutation:

@inline pext(x::UInt64, y::UInt64) = ccall("llvm.x86.bmi.pext.64", llvmcall, UInt64, (UInt64, UInt64), x, y)

function permute_pext(x)
    x & 0xfff0000000000000 | # remove this if you don't need to preserve the top 12 bits
        pext(x, 0x1111111111111) |
        pext(x, 0x2222222222222) << 13 |
        pext(x, 0x4444444444444) << 26 |
        pext(x, 0x8888888888888) << 39
end

To figure out if the pext instruction is available, you could do something like this:

CPUInfo = zeros(Int32, 4)
ccall(:jl_cpuidex, Cvoid, (Ptr{Cint}, Cint, Cint), CPUInfo, 7, 0)
has_pext = CPUInfo[2] & 0x100 != 0

Sample timings below. Note that I’ve put @noinline in front of all permute functions to prevent inlining.

julia> using BenchmarkTools

julia> A = rand(UInt64, 100000);

julia> @btime (for n = $A uninterleavenibble52(n) end)
  702.064 μs (0 allocations: 0 bytes)

julia> @btime (for n = $A permute_lookup(n) end)
  371.155 μs (0 allocations: 0 bytes)

julia> @btime (for n = $A permute_pext(n) end)
  200.630 μs (0 allocations: 0 bytes)

This corresponds to ~20.4 cycles for the bit permute implementation, ~10.8 cycles for the look-up-table
and ~5.8 cycles for the pext implementation.

Topic		Replies	Views
Deinterleave data Performance arrays	12	956	November 25, 2021
Shifting the bit in the array General Usage	29	1020	June 22, 2023
What is the best way to regroup the ones to the most right of every nibbles in a UInt? General Usage bit-twiddling	7	863	January 6, 2019
Multiplication of `BitMatrix`s for linear algebra modulo 2 Performance linearalgebra	10	590	February 29, 2024
Optimization: How to make sure XOR is performed in chunks New to Julia	33	2335	July 6, 2022

Interleaving bits (Z-order curve,morton code).Transpose binary matrix

Related topics