Hi I am trying to make the fastest code to UN-interleaving every 4 bits of a `UInt64`

.

So if I have the binary representation of my `Int`

as :

`ABCD/EFGH/IJKL/MNOP/QRST/UVWX`

(letters are 0 and 1)

I am looking for something like this `AEIMQU/BFJNRV/CGKOSW/DHLPTX`

It is a sort of binary matrix (4xN) transpose. Is there some build in function for this?

Looking around I made this function for `UInt32`

:

```
function packeveryfour(x::UInt32)#each step swaps the second and third quartiles of successively bigger pieces, result is odd bits go right and even bits go left
t = (x ⊻ (x >> 1)) & 0x22222222#mask 00100010001000100010001000100010
x = x ⊻ t ⊻ (t<<1)
t = (x ⊻ (x >> 2)) & 0x0C0C0C0C#mask 00001100000011000000110000001100
x = x ⊻ t ⊻ (t<<2)
t = (x ⊻ (x >> 4)) & 0x00F000F0#mask 00000000111100000000000011110000
x = x ⊻ t ⊻ (t<<4)
t = (x ⊻ (x >> 8)) & 0x0000FF00#mask 00000000000000001111111100000000
x = x ⊻ t ⊻ (t<<8)
t = (x ⊻ (x >> 1)) & 0x22222222#repeat the process
x = x ⊻ t ⊻ (t<<1)
t = (x ⊻ (x >> 2)) & 0x0C0C0C0C
x = x ⊻ t ⊻ (t<<2)
t = (x ⊻ (x >> 4)) & 0x00F000F0
x = x ⊻ t ⊻ (t<<4)
t = (x ⊻ (x >> 8)) & 0x0000FF00
x = x ⊻ t ⊻ (t<<8)
return x
end
```

This is quite fast,

But there is some problems, first I repeat two time the same thing and second there are some zeros in the middle of my answer when the input is smaller than a full `UInt32`

since this function is gonna use the empty bits left part of the `UInt32`

.

In practice I will work on the 52 first bits of a `UInt64`

so I will need to transpose a (13x4) Matrix.

To summarize:

- Is there a build in (or package) function to transpose binary matrix (store in a
`Int`

type)? - Is there a build in function to interleave bits (even a llvm/processor instruction)?
- Have you any others ideas how to make this very efficiently?

Thanks