What is the best way to regroup the ones to the most right of every nibbles in a UInt?

rapasite · November 28, 2018, 8:43pm

Deleting both “binary gaps” and “leading zeros” for every nibbles of Integer’s binary representation,(packing the ones, bits packing,deleting/removing zeros in the middle).
example:

#Let's take N a {UInt32}
N=0xa1c7bf2e
#the binary visualization of N=1010|0001|1100|0111|1011|1111|0010|1110
#numbers of ones per nibble is ___2|___1|___2|___3|___3|___4|___1|___3
#the target result should be   0011|0001|0011|0111|0111|1111|0001|0111

I came up with that:

function packnibble(a)
    t=(a & ((a<<1)|0x55555555)) | ((a>>1)&0x55555555)#look 2-bit groups and change the "10" in "01"
    r=(t & ((t<<2)|0x33333333)) | ((t>>2)&0x33333333)#look nibble and finish the job except for "0101" witch stay the same 
    r=(r & ((r<<1)|0xbbbbbbbb)) | ((r>>1)&0x22222222)#resolve the "0101" case making "0011" with it
end

Is there any way to improve this function or to make another one? thanks

JeffreySarnoff · November 29, 2018, 1:47am

@rapasite Do you have reason to believe that your approach is problematic?
Your function is quite fast in practice.
If saving the time Julia takes to call this function matters,
declaring it @inline may be appropriate.

@inline function packnibbles(x::UInt32)
    x = (x & ((x << 1) | 0x55555555)) | ((x >> 1) & 0x55555555)  # 0b10  ↦ 0b01
    x = (x & ((x << 2) | 0x33333333)) | ((x >> 2) & 0x33333333)  # 0b0100 ↦ 0b0001
    x = (x & ((x << 1) | 0xbbbbbbbb)) | ((x >> 1) & 0x22222222)  # 0b0101 ↦ 0b0011
    return x
end

rapasite · December 29, 2018, 10:07pm

Hi,
thank you @JeffreySarnoff
I don’t think it is problematic, I just want to make the best one
after a white night I came up with something better hahaha:

# 0b1000 ↦ 0b0100 ↦ 0b0010 ↦ 0b0001
# 0b1100 ↦ 0b1010 ↦ 0b0101 ↦ 0b0011
# (0b0110,0b1001) ↦ 0b0101
# 0b1110 ↦ 0b1101 ↦ 0b1011 ↦ 0b0111
@inline function packnibble5(x::UInt64)
    x -= x>>1 & ~x & 0x7777777777777777 # 0b10  ↦ 0b01
    x -= x>>1 & ~x & 0x7777777777777777
    x -= x>>1 & ~x & 0x7777777777777777
    return x
end

If someone can improve this even further It’ll make me happy!

Thanks

TheBB · January 1, 2019, 8:00pm

Using a lookup table is often the way to go for localized bit operations like this. It’s dumb and not very satisfying, but fast.

const bytes = @SVector [
    0x00, 0x01, 0x01, 0x03, 0x01, 0x03, 0x03, 0x07, 0x01, 0x03, 0x03, 0x07, 0x03, 0x07, 0x07, 0x0f,
    0x10, 0x11, 0x11, 0x13, 0x11, 0x13, 0x13, 0x17, 0x11, 0x13, 0x13, 0x17, 0x13, 0x17, 0x17, 0x1f,
    0x10, 0x11, 0x11, 0x13, 0x11, 0x13, 0x13, 0x17, 0x11, 0x13, 0x13, 0x17, 0x13, 0x17, 0x17, 0x1f,
    0x30, 0x31, 0x31, 0x33, 0x31, 0x33, 0x33, 0x37, 0x31, 0x33, 0x33, 0x37, 0x33, 0x37, 0x37, 0x3f,
    0x10, 0x11, 0x11, 0x13, 0x11, 0x13, 0x13, 0x17, 0x11, 0x13, 0x13, 0x17, 0x13, 0x17, 0x17, 0x1f,
    0x30, 0x31, 0x31, 0x33, 0x31, 0x33, 0x33, 0x37, 0x31, 0x33, 0x33, 0x37, 0x33, 0x37, 0x37, 0x3f,
    0x30, 0x31, 0x31, 0x33, 0x31, 0x33, 0x33, 0x37, 0x31, 0x33, 0x33, 0x37, 0x33, 0x37, 0x37, 0x3f,
    0x70, 0x71, 0x71, 0x73, 0x71, 0x73, 0x73, 0x77, 0x71, 0x73, 0x73, 0x77, 0x73, 0x77, 0x77, 0x7f,
    0x10, 0x11, 0x11, 0x13, 0x11, 0x13, 0x13, 0x17, 0x11, 0x13, 0x13, 0x17, 0x13, 0x17, 0x17, 0x1f,
    0x30, 0x31, 0x31, 0x33, 0x31, 0x33, 0x33, 0x37, 0x31, 0x33, 0x33, 0x37, 0x33, 0x37, 0x37, 0x3f,
    0x30, 0x31, 0x31, 0x33, 0x31, 0x33, 0x33, 0x37, 0x31, 0x33, 0x33, 0x37, 0x33, 0x37, 0x37, 0x3f,
    0x70, 0x71, 0x71, 0x73, 0x71, 0x73, 0x73, 0x77, 0x71, 0x73, 0x73, 0x77, 0x73, 0x77, 0x77, 0x7f,
    0x30, 0x31, 0x31, 0x33, 0x31, 0x33, 0x33, 0x37, 0x31, 0x33, 0x33, 0x37, 0x33, 0x37, 0x37, 0x3f,
    0x70, 0x71, 0x71, 0x73, 0x71, 0x73, 0x73, 0x77, 0x71, 0x73, 0x73, 0x77, 0x73, 0x77, 0x77, 0x7f,
    0x70, 0x71, 0x71, 0x73, 0x71, 0x73, 0x73, 0x77, 0x71, 0x73, 0x73, 0x77, 0x73, 0x77, 0x77, 0x7f,
    0xf0, 0xf1, 0xf1, 0xf3, 0xf1, 0xf3, 0xf3, 0xf7, 0xf1, 0xf3, 0xf3, 0xf7, 0xf3, 0xf7, 0xf7, 0xff,
]

function packnibble6(x::UInt64)
    @inbounds (
        bytes[1 + x         & 0xff]       |
        bytes[1 + (x >>  8) & 0xff] <<  8 |
        bytes[1 + (x >> 16) & 0xff] << 16 |
        bytes[1 + (x >> 24) & 0xff] << 24 |
        bytes[1 + (x >> 32) & 0xff] << 32 |
        bytes[1 + (x >> 40) & 0xff] << 40 |
        bytes[1 + (x >> 48) & 0xff] << 48 |
        bytes[1 + (x >> 56) & 0xff] << 56
    )
end

@btime packnibble5($(rand(UInt64)))
@btime packnibble6($(rand(UInt64)))

Gives me

  2.073 ns (0 allocations: 0 bytes)
  1.300 ns (0 allocations: 0 bytes)

Edit: Fixed #$%@ one-based indexing.

bennedich · January 1, 2019, 8:58pm

Does that code work for you? I get:

julia> packnibble6(0xaaaaaaaaaaaaaaaa)
0x33

Due to:

julia> typeof(bytes)
SArray{Tuple{256},UInt8,1,256}

julia> typeof(bytes[1 + ((x >>  8) & 0xff)] << 8)
UInt8

julia> bytes[1 + ((x >>  8) & 0xff)] << 8
0x00

If I change bytes to be UInt64, it works, but is no longer faster than packnibble5.

TheBB · January 1, 2019, 9:20pm

Hm yeah, I messed that up. Sorry.

rapasite · January 6, 2019, 3:54pm

Yeah look up table is not very satisfying but I guess I should learn how to do it properly.

What is the best ways to store the value? is that with a @SVector? or a standard matrix? or a hash-table?
what is the best way the test the value?

In other words what is the state of the art optimize fastest lookup table in Julia?

thanks you

bennedich · January 6, 2019, 11:57pm

Just use a regular, constant, array:

const LOOKUP_TABLE = UInt64[ ... ]

@SVector won’t help. In this case, it actually makes things worse. Definitely do not use a hash table / dictionary as long as your keys are within a reasonable range and you’re storing all of them.

Note that your packnibble5 performs considerably better than the lookup table suggested above, especially when vectorized with simd instructions. (Also note that in the below benchmarks, I’ve corrected packnibble6 to use UInt64s.)

julia> v = rand(UInt64, 100_000);

julia> @btime (s = UInt64(0); for n = $v s += packnibble6(n) end; s);
  336.744 μs (0 allocations: 0 bytes)

julia> @btime (s = UInt64(0); for n = $v s += packnibble5(n) end; s);
  145.011 μs (0 allocations: 0 bytes)

Without simd, it’s already more than 2x as fast. Now let’s vectorize it:

julia> @btime (s = UInt64(0); @simd for n = $v s += packnibble6(n) end; s);
  279.299 μs (0 allocations: 0 bytes)

julia> @btime (s = UInt64(0); @simd for n = $v s += packnibble5(n) end; s);
  33.696 μs (0 allocations: 0 bytes)

More than 8 times as fast…

Topic		Replies	Views
What is the most efficient binary representation of unsigned integers in Julia? New to Julia question , bitstring	17	3209	March 22, 2021
Interleaving bits (Z-order curve,morton code).Transpose binary matrix Performance	8	2696	December 16, 2018
Compiling to branch table General Usage question	18	2726	December 29, 2018
Simple bit manipulation question New to Julia	6	1337	March 1, 2018
Shifting the bit in the array General Usage	29	1020	June 22, 2023

What is the best way to regroup the ones to the most right of every nibbles in a UInt?

Related topics