# Interleaving bits (Z-order curve,morton code).Transpose binary matrix

Hi I am trying to make the fastest code to UN-interleaving every 4 bits of a `UInt64`.
So if I have the binary representation of my `Int` as :
`ABCD/EFGH/IJKL/MNOP/QRST/UVWX` (letters are 0 and 1)
I am looking for something like this `AEIMQU/BFJNRV/CGKOSW/DHLPTX`

It is a sort of binary matrix (4xN) transpose. Is there some build in function for this?

Looking around I made this function for `UInt32`:

``````function packeveryfour(x::UInt32)#each step swaps  the  second  and  third  quartiles  of  successively  bigger  pieces, result is odd bits go right and even bits go left
t = (x ⊻ (x >> 1)) & 0x22222222#mask 00100010001000100010001000100010
x = x ⊻ t ⊻ (t<<1)
t = (x ⊻ (x >> 2)) & 0x0C0C0C0C#mask 00001100000011000000110000001100
x = x ⊻ t ⊻ (t<<2)
t = (x ⊻ (x >> 4)) & 0x00F000F0#mask 00000000111100000000000011110000
x = x ⊻ t ⊻ (t<<4)
t = (x ⊻ (x >> 8)) & 0x0000FF00#mask 00000000000000001111111100000000
x = x ⊻ t ⊻ (t<<8)
t = (x ⊻ (x >> 1)) & 0x22222222#repeat the process
x = x ⊻ t ⊻ (t<<1)
t = (x ⊻ (x >> 2)) & 0x0C0C0C0C
x = x ⊻ t ⊻ (t<<2)
t = (x ⊻ (x >> 4)) & 0x00F000F0
x = x ⊻ t ⊻ (t<<4)
t = (x ⊻ (x >> 8)) & 0x0000FF00
x = x ⊻ t ⊻ (t<<8)
return x
end
``````

This is quite fast,
But there is some problems, first I repeat two time the same thing and second there are some zeros in the middle of my answer when the input is smaller than a full `UInt32` since this function is gonna use the empty bits left part of the `UInt32`.

In practice I will work on the 52 first bits of a `UInt64` so I will need to transpose a (13x4) Matrix.

To summarize:

• Is there a build in (or package) function to transpose binary matrix (store in a `Int` type)?
• Is there a build in function to interleave bits (even a llvm/processor instruction)?
• Have you any others ideas how to make this very efficiently?

Thanks

1 Like

This sounds like a fun question. I’m not in a position to look into it at the moment, but if there’s a way to do it cleverly, @foobar_lv2 on this forum should be able to help you out Do you just need to interleave a single UInt64 at a time though, or are you doing this for multiple values? If multiple values, you may be able to vectorize the code efficiently with AVX.

Yeah just one ´UInt64´ at a time " the 52 first bits of a `UInt64`".

I’d try something like http://programming.sirrida.de/calcperm.php. The web version only supports 32 bit, but maybe you can find a variant for 64 bit.

Example: transpose a 4x8 matrix stored in 32 bit:

``````julia> reshape(0:31, 4, 8)'[:]
32-element Array{Int64,1}:
0
4
8
...
``````

Copy-paste this desired permutation into the source generator:

This is a BPC permutation.

Best method found: BP permutation (about 20 cycles on superscalar processors):

x = bit_permute_step(x, 0x00aa00aa, 7); // Bit index swap 0,3
x = bit_permute_step(x, 0x0000cccc, 14); // Bit index swap 1,4
x = bit_permute_step(x, 0x00f000f0, 4); // Bit index swap 2,3
x = bit_permute_step(x, 0x0000ff00, 8); // Bit index swap 3,4

See documentation to bit_permute_step, bit_permute_step_simple, rol.

4 Likes

Wow amazing site thx @foobar_lv2.

I download the source code of the calculator at the end of the page "There is an even better calculator calcperm.* "
I change the include on the top of calcperm.cpp to use 64 bits before compiling with g++.
And I got this function:

``````function uninterleavenibble52(x)
t = ((x >> 1) ⊻ x) & 0x0000505050505050
x = (x ⊻ t) ⊻ (t << 1)
t = ((x >> 2) ⊻ x) & 0x0000231023102310
x = (x ⊻ t) ⊻ (t << 2)
t = ((x >> 4) ⊻ x) & 0x000500000f0f0000
x = (x ⊻ t) ⊻ (t << 4)
t = ((x >> 32) ⊻ x) & 0x000000001a084200
x = (x ⊻ t) ⊻ (t << 32)
t = ((x >> 8) ⊻ x) & 0x001a002900c100d6
x = (x ⊻ t) ⊻ (t << 8)
t = ((x >> 16) ⊻ x) & 0x0000105a0000a552
x = (x ⊻ t) ⊻ (t << 16)
t = ((x >> 32) ⊻ x) & 0x000000001a4ab508
x = (x ⊻ t) ⊻ (t << 32)
t = ((x >> 4) ⊻ x) & 0x00050d09090b0a0a
x = (x ⊻ t) ⊻ (t << 4)
return x
end
``````

Witch does exactly what I need.
If someone is up to the task It will be a good idea to translate this nice permutation generator in a pure Julia package.

Great solution (and website)!

Some other ideas. First, if you could pack your bits as 4x16 (with 3 bits padding) instead of 4x13, you could probably speed things up a bit.

Second, consider a look-up-table. You can use any number of bits for the table. I’ve chosen 13 below, quite arbitrarily, since it divides 52, but other sizes might perform better depending on how it fits into cache:

``````const PERM13 = zeros(Int64, 4*2^13)

for p = 0:3, n = 0:2^13-1
PERM13[n+1 + p*2^13] = uninterleavenibble52(n << (p*13))
end

function permute_lookup(x)
x & 0xfff0000000000000 | # remove this if you don't need to preserve the top 12 bits
PERM13[1 + (x & 0b1111111111111)] |
PERM13[8193 + ((x >> 13) & 0b1111111111111)] |
PERM13[16385 + ((x >> 26) & 0b1111111111111)] |
PERM13[24577 + ((x >> 39) & 0b1111111111111)]
end
``````

In benchmarks this performs great, presumably because it fits nicely in cache, but in practice that might not be the case, so you’d need to test it out.

Third, consider using the `pext` instruction available on Haswell and newer Intel processors (after 2013) to greatly simplify/speed up the permutation:

``````@inline pext(x::UInt64, y::UInt64) = ccall("llvm.x86.bmi.pext.64", llvmcall, UInt64, (UInt64, UInt64), x, y)

function permute_pext(x)
x & 0xfff0000000000000 | # remove this if you don't need to preserve the top 12 bits
pext(x, 0x1111111111111) |
pext(x, 0x2222222222222) << 13 |
pext(x, 0x4444444444444) << 26 |
pext(x, 0x8888888888888) << 39
end
``````

To figure out if the `pext` instruction is available, you could do something like this:

``````CPUInfo = zeros(Int32, 4)
ccall(:jl_cpuidex, Cvoid, (Ptr{Cint}, Cint, Cint), CPUInfo, 7, 0)
has_pext = CPUInfo & 0x100 != 0
``````

Sample timings below. Note that I’ve put `@noinline` in front of all permute functions to prevent inlining.

``````julia> using BenchmarkTools

julia> A = rand(UInt64, 100000);

julia> @btime (for n = \$A uninterleavenibble52(n) end)
702.064 μs (0 allocations: 0 bytes)

julia> @btime (for n = \$A permute_lookup(n) end)
371.155 μs (0 allocations: 0 bytes)

julia> @btime (for n = \$A permute_pext(n) end)
200.630 μs (0 allocations: 0 bytes)
``````

This corresponds to ~20.4 cycles for the bit permute implementation, ~10.8 cycles for the look-up-table
and ~5.8 cycles for the pext implementation.

Thanks you for this very detail answer!

@StefanKarpinski

Is there any way to use llvm directly in Julia without a ccall?

It seem more natural for me but I am far from understanding llvm an Julia interactions.

And why can’t we have a more direct access to this `pext` instruction from the processor?

ps:I prefer not to use a lookup table.

My results are very differents than yours:

``````julia> @btime (for n = \$A uninterleavenibble52(n) end)
1.499 ns (0 allocations: 0 bytes)

julia> @btime (for n = \$A permute_pext(n) end)
35.799 μs (0 allocations: 0 bytes)

but

julia> @btime uninterleavenibble52(x)
23.594 ns (1 allocation: 16 bytes)
0xffff00072000003f

julia> @btime permute_pext(x)
19.357 ns (1 allocation: 16 bytes)
0xffff00072000003f
``````

I don’t understand.

Benchmarking is tricky to get right. In your first example, 1.5 ns corresponds to only a couple of clock cycles, which is not realistic for looping over thousands of elements. Probably what happens there is the compiler realizes you’re not using the results anyway and never even calls your method. In my timings, I put a `@noinline` annotation in front of each function declaration to prevent this.

The most accurate thing you can do is to time your actual application.

Also, recall to interpolate variables when using `@btime` (the dollar sign in front of variables). See the difference:

``````julia> @btime uninterleavenibble52(\$x)
6.370 ns (0 allocations: 0 bytes)
0xffff00072000003f

julia> @btime permute_pext(\$x)
2.375 ns (0 allocations: 0 bytes)
0xffff00072000003f
``````