# What is the best way to regroup the ones to the most right of every nibbles in a UInt?

Deleting both “binary gaps” and “leading zeros” for every nibbles of Integer’s binary representation,(packing the ones, bits packing,deleting/removing zeros in the middle).
example:

``````#Let's take N a {UInt32}
N=0xa1c7bf2e
#the binary visualization of N=1010|0001|1100|0111|1011|1111|0010|1110
#numbers of ones per nibble is ___2|___1|___2|___3|___3|___4|___1|___3
#the target result should be   0011|0001|0011|0111|0111|1111|0001|0111
``````

I came up with that:

``````function packnibble(a)
t=(a & ((a<<1)|0x55555555)) | ((a>>1)&0x55555555)#look 2-bit groups and change the "10" in "01"
r=(t & ((t<<2)|0x33333333)) | ((t>>2)&0x33333333)#look nibble and finish the job except for "0101" witch stay the same
r=(r & ((r<<1)|0xbbbbbbbb)) | ((r>>1)&0x22222222)#resolve the "0101" case making "0011" with it
end
``````

Is there any way to improve this function or to make another one? thanks

@rapasite Do you have reason to believe that your approach is problematic?
Your function is quite fast in practice.
If saving the time Julia takes to call this function matters,
declaring it `@inline` may be appropriate.

``````@inline function packnibbles(x::UInt32)
x = (x & ((x << 1) | 0x55555555)) | ((x >> 1) & 0x55555555)  # 0b10  ↦ 0b01
x = (x & ((x << 2) | 0x33333333)) | ((x >> 2) & 0x33333333)  # 0b0100 ↦ 0b0001
x = (x & ((x << 1) | 0xbbbbbbbb)) | ((x >> 1) & 0x22222222)  # 0b0101 ↦ 0b0011
return x
end``````

Hi,
thank you @JeffreySarnoff
I don’t think it is problematic, I just want to make the best one
after a white night I came up with something better hahaha:

``````# 0b1000 ↦ 0b0100 ↦ 0b0010 ↦ 0b0001
# 0b1100 ↦ 0b1010 ↦ 0b0101 ↦ 0b0011
# (0b0110,0b1001) ↦ 0b0101
# 0b1110 ↦ 0b1101 ↦ 0b1011 ↦ 0b0111
@inline function packnibble5(x::UInt64)
x -= x>>1 & ~x & 0x7777777777777777 # 0b10  ↦ 0b01
x -= x>>1 & ~x & 0x7777777777777777
x -= x>>1 & ~x & 0x7777777777777777
return x
end
``````

If someone can improve this even further It’ll make me happy!

Thanks

1 Like

Using a lookup table is often the way to go for localized bit operations like this. It’s dumb and not very satisfying, but fast.

``````const bytes = @SVector [
0x00, 0x01, 0x01, 0x03, 0x01, 0x03, 0x03, 0x07, 0x01, 0x03, 0x03, 0x07, 0x03, 0x07, 0x07, 0x0f,
0x10, 0x11, 0x11, 0x13, 0x11, 0x13, 0x13, 0x17, 0x11, 0x13, 0x13, 0x17, 0x13, 0x17, 0x17, 0x1f,
0x10, 0x11, 0x11, 0x13, 0x11, 0x13, 0x13, 0x17, 0x11, 0x13, 0x13, 0x17, 0x13, 0x17, 0x17, 0x1f,
0x30, 0x31, 0x31, 0x33, 0x31, 0x33, 0x33, 0x37, 0x31, 0x33, 0x33, 0x37, 0x33, 0x37, 0x37, 0x3f,
0x10, 0x11, 0x11, 0x13, 0x11, 0x13, 0x13, 0x17, 0x11, 0x13, 0x13, 0x17, 0x13, 0x17, 0x17, 0x1f,
0x30, 0x31, 0x31, 0x33, 0x31, 0x33, 0x33, 0x37, 0x31, 0x33, 0x33, 0x37, 0x33, 0x37, 0x37, 0x3f,
0x30, 0x31, 0x31, 0x33, 0x31, 0x33, 0x33, 0x37, 0x31, 0x33, 0x33, 0x37, 0x33, 0x37, 0x37, 0x3f,
0x70, 0x71, 0x71, 0x73, 0x71, 0x73, 0x73, 0x77, 0x71, 0x73, 0x73, 0x77, 0x73, 0x77, 0x77, 0x7f,
0x10, 0x11, 0x11, 0x13, 0x11, 0x13, 0x13, 0x17, 0x11, 0x13, 0x13, 0x17, 0x13, 0x17, 0x17, 0x1f,
0x30, 0x31, 0x31, 0x33, 0x31, 0x33, 0x33, 0x37, 0x31, 0x33, 0x33, 0x37, 0x33, 0x37, 0x37, 0x3f,
0x30, 0x31, 0x31, 0x33, 0x31, 0x33, 0x33, 0x37, 0x31, 0x33, 0x33, 0x37, 0x33, 0x37, 0x37, 0x3f,
0x70, 0x71, 0x71, 0x73, 0x71, 0x73, 0x73, 0x77, 0x71, 0x73, 0x73, 0x77, 0x73, 0x77, 0x77, 0x7f,
0x30, 0x31, 0x31, 0x33, 0x31, 0x33, 0x33, 0x37, 0x31, 0x33, 0x33, 0x37, 0x33, 0x37, 0x37, 0x3f,
0x70, 0x71, 0x71, 0x73, 0x71, 0x73, 0x73, 0x77, 0x71, 0x73, 0x73, 0x77, 0x73, 0x77, 0x77, 0x7f,
0x70, 0x71, 0x71, 0x73, 0x71, 0x73, 0x73, 0x77, 0x71, 0x73, 0x73, 0x77, 0x73, 0x77, 0x77, 0x7f,
0xf0, 0xf1, 0xf1, 0xf3, 0xf1, 0xf3, 0xf3, 0xf7, 0xf1, 0xf3, 0xf3, 0xf7, 0xf3, 0xf7, 0xf7, 0xff,
]

function packnibble6(x::UInt64)
@inbounds (
bytes[1 + x         & 0xff]       |
bytes[1 + (x >>  8) & 0xff] <<  8 |
bytes[1 + (x >> 16) & 0xff] << 16 |
bytes[1 + (x >> 24) & 0xff] << 24 |
bytes[1 + (x >> 32) & 0xff] << 32 |
bytes[1 + (x >> 40) & 0xff] << 40 |
bytes[1 + (x >> 48) & 0xff] << 48 |
bytes[1 + (x >> 56) & 0xff] << 56
)
end

@btime packnibble5(\$(rand(UInt64)))
@btime packnibble6(\$(rand(UInt64)))
``````

Gives me

``````  2.073 ns (0 allocations: 0 bytes)
1.300 ns (0 allocations: 0 bytes)
``````

Edit: Fixed #\$%@ one-based indexing.

Does that code work for you? I get:

``````julia> packnibble6(0xaaaaaaaaaaaaaaaa)
0x33
``````

Due to:

``````julia> typeof(bytes)
SArray{Tuple{256},UInt8,1,256}

julia> typeof(bytes[1 + ((x >>  8) & 0xff)] << 8)
UInt8

julia> bytes[1 + ((x >>  8) & 0xff)] << 8
0x00
``````

If I change `bytes` to be `UInt64`, it works, but is no longer faster than `packnibble5`.

Hm yeah, I messed that up. Sorry.

Yeah look up table is not very satisfying but I guess I should learn how to do it properly.

1. What is the best ways to store the value? is that with a `@SVector`? or a standard matrix? or a hash-table?

2. what is the best way the test the value?

In other words what is the state of the art optimize fastest lookup table in Julia?

thanks you

Just use a regular, constant, array:

``````const LOOKUP_TABLE = UInt64[ ... ]
``````

`@SVector` won’t help. In this case, it actually makes things worse. Definitely do not use a hash table / dictionary as long as your keys are within a reasonable range and you’re storing all of them.

Note that your `packnibble5` performs considerably better than the lookup table suggested above, especially when vectorized with simd instructions. (Also note that in the below benchmarks, I’ve corrected `packnibble6` to use `UInt64`s.)

``````julia> v = rand(UInt64, 100_000);

julia> @btime (s = UInt64(0); for n = \$v s += packnibble6(n) end; s);
336.744 μs (0 allocations: 0 bytes)

julia> @btime (s = UInt64(0); for n = \$v s += packnibble5(n) end; s);
145.011 μs (0 allocations: 0 bytes)
``````

Without `simd`, it’s already more than 2x as fast. Now let’s vectorize it:

``````julia> @btime (s = UInt64(0); @simd for n = \$v s += packnibble6(n) end; s);
279.299 μs (0 allocations: 0 bytes)

julia> @btime (s = UInt64(0); @simd for n = \$v s += packnibble5(n) end; s);
33.696 μs (0 allocations: 0 bytes)
``````

More than 8 times as fast…

1 Like