What is the best way to regroup the ones to the most right of every nibbles in a UInt?

Deleting both “binary gaps” and “leading zeros” for every nibbles of Integer’s binary representation,(packing the ones, bits packing,deleting/removing zeros in the middle).
example:

#Let's take N a {UInt32}
N=0xa1c7bf2e
#the binary visualization of N=1010|0001|1100|0111|1011|1111|0010|1110
#numbers of ones per nibble is ___2|___1|___2|___3|___3|___4|___1|___3
#the target result should be   0011|0001|0011|0111|0111|1111|0001|0111

I came up with that:

function packnibble(a)
    t=(a & ((a<<1)|0x55555555)) | ((a>>1)&0x55555555)#look 2-bit groups and change the "10" in "01"
    r=(t & ((t<<2)|0x33333333)) | ((t>>2)&0x33333333)#look nibble and finish the job except for "0101" witch stay the same 
    r=(r & ((r<<1)|0xbbbbbbbb)) | ((r>>1)&0x22222222)#resolve the "0101" case making "0011" with it
end

Is there any way to improve this function or to make another one? thanks

@rapasite Do you have reason to believe that your approach is problematic?
Your function is quite fast in practice.
If saving the time Julia takes to call this function matters,
declaring it @inline may be appropriate.

@inline function packnibbles(x::UInt32)
    x = (x & ((x << 1) | 0x55555555)) | ((x >> 1) & 0x55555555)  # 0b10  ↦ 0b01
    x = (x & ((x << 2) | 0x33333333)) | ((x >> 2) & 0x33333333)  # 0b0100 ↦ 0b0001
    x = (x & ((x << 1) | 0xbbbbbbbb)) | ((x >> 1) & 0x22222222)  # 0b0101 ↦ 0b0011
    return x
end

Hi,
thank you @JeffreySarnoff
I don’t think it is problematic, I just want to make the best one
after a white night I came up with something better hahaha:

# 0b1000 ↦ 0b0100 ↦ 0b0010 ↦ 0b0001
# 0b1100 ↦ 0b1010 ↦ 0b0101 ↦ 0b0011
# (0b0110,0b1001) ↦ 0b0101
# 0b1110 ↦ 0b1101 ↦ 0b1011 ↦ 0b0111
@inline function packnibble5(x::UInt64)
    x -= x>>1 & ~x & 0x7777777777777777 # 0b10  ↦ 0b01
    x -= x>>1 & ~x & 0x7777777777777777
    x -= x>>1 & ~x & 0x7777777777777777
    return x
end

If someone can improve this even further It’ll make me happy!

Thanks

1 Like

Using a lookup table is often the way to go for localized bit operations like this. It’s dumb and not very satisfying, but fast.

const bytes = @SVector [
    0x00, 0x01, 0x01, 0x03, 0x01, 0x03, 0x03, 0x07, 0x01, 0x03, 0x03, 0x07, 0x03, 0x07, 0x07, 0x0f,
    0x10, 0x11, 0x11, 0x13, 0x11, 0x13, 0x13, 0x17, 0x11, 0x13, 0x13, 0x17, 0x13, 0x17, 0x17, 0x1f,
    0x10, 0x11, 0x11, 0x13, 0x11, 0x13, 0x13, 0x17, 0x11, 0x13, 0x13, 0x17, 0x13, 0x17, 0x17, 0x1f,
    0x30, 0x31, 0x31, 0x33, 0x31, 0x33, 0x33, 0x37, 0x31, 0x33, 0x33, 0x37, 0x33, 0x37, 0x37, 0x3f,
    0x10, 0x11, 0x11, 0x13, 0x11, 0x13, 0x13, 0x17, 0x11, 0x13, 0x13, 0x17, 0x13, 0x17, 0x17, 0x1f,
    0x30, 0x31, 0x31, 0x33, 0x31, 0x33, 0x33, 0x37, 0x31, 0x33, 0x33, 0x37, 0x33, 0x37, 0x37, 0x3f,
    0x30, 0x31, 0x31, 0x33, 0x31, 0x33, 0x33, 0x37, 0x31, 0x33, 0x33, 0x37, 0x33, 0x37, 0x37, 0x3f,
    0x70, 0x71, 0x71, 0x73, 0x71, 0x73, 0x73, 0x77, 0x71, 0x73, 0x73, 0x77, 0x73, 0x77, 0x77, 0x7f,
    0x10, 0x11, 0x11, 0x13, 0x11, 0x13, 0x13, 0x17, 0x11, 0x13, 0x13, 0x17, 0x13, 0x17, 0x17, 0x1f,
    0x30, 0x31, 0x31, 0x33, 0x31, 0x33, 0x33, 0x37, 0x31, 0x33, 0x33, 0x37, 0x33, 0x37, 0x37, 0x3f,
    0x30, 0x31, 0x31, 0x33, 0x31, 0x33, 0x33, 0x37, 0x31, 0x33, 0x33, 0x37, 0x33, 0x37, 0x37, 0x3f,
    0x70, 0x71, 0x71, 0x73, 0x71, 0x73, 0x73, 0x77, 0x71, 0x73, 0x73, 0x77, 0x73, 0x77, 0x77, 0x7f,
    0x30, 0x31, 0x31, 0x33, 0x31, 0x33, 0x33, 0x37, 0x31, 0x33, 0x33, 0x37, 0x33, 0x37, 0x37, 0x3f,
    0x70, 0x71, 0x71, 0x73, 0x71, 0x73, 0x73, 0x77, 0x71, 0x73, 0x73, 0x77, 0x73, 0x77, 0x77, 0x7f,
    0x70, 0x71, 0x71, 0x73, 0x71, 0x73, 0x73, 0x77, 0x71, 0x73, 0x73, 0x77, 0x73, 0x77, 0x77, 0x7f,
    0xf0, 0xf1, 0xf1, 0xf3, 0xf1, 0xf3, 0xf3, 0xf7, 0xf1, 0xf3, 0xf3, 0xf7, 0xf3, 0xf7, 0xf7, 0xff,
]

function packnibble6(x::UInt64)
    @inbounds (
        bytes[1 + x         & 0xff]       |
        bytes[1 + (x >>  8) & 0xff] <<  8 |
        bytes[1 + (x >> 16) & 0xff] << 16 |
        bytes[1 + (x >> 24) & 0xff] << 24 |
        bytes[1 + (x >> 32) & 0xff] << 32 |
        bytes[1 + (x >> 40) & 0xff] << 40 |
        bytes[1 + (x >> 48) & 0xff] << 48 |
        bytes[1 + (x >> 56) & 0xff] << 56
    )
end

@btime packnibble5($(rand(UInt64)))
@btime packnibble6($(rand(UInt64)))

Gives me

  2.073 ns (0 allocations: 0 bytes)
  1.300 ns (0 allocations: 0 bytes)

Edit: Fixed #$%@ one-based indexing.

Does that code work for you? I get:

julia> packnibble6(0xaaaaaaaaaaaaaaaa)
0x33

Due to:

julia> typeof(bytes)
SArray{Tuple{256},UInt8,1,256}

julia> typeof(bytes[1 + ((x >>  8) & 0xff)] << 8)
UInt8

julia> bytes[1 + ((x >>  8) & 0xff)] << 8
0x00

If I change bytes to be UInt64, it works, but is no longer faster than packnibble5.

Hm yeah, I messed that up. Sorry.

Yeah look up table is not very satisfying but I guess I should learn how to do it properly.

  1. What is the best ways to store the value? is that with a @SVector? or a standard matrix? or a hash-table?

  2. what is the best way the test the value?

In other words what is the state of the art optimize fastest lookup table in Julia?

thanks you

Just use a regular, constant, array:

const LOOKUP_TABLE = UInt64[ ... ]

@SVector won’t help. In this case, it actually makes things worse. Definitely do not use a hash table / dictionary as long as your keys are within a reasonable range and you’re storing all of them.

Note that your packnibble5 performs considerably better than the lookup table suggested above, especially when vectorized with simd instructions. (Also note that in the below benchmarks, I’ve corrected packnibble6 to use UInt64s.)

julia> v = rand(UInt64, 100_000);

julia> @btime (s = UInt64(0); for n = $v s += packnibble6(n) end; s);
  336.744 μs (0 allocations: 0 bytes)

julia> @btime (s = UInt64(0); for n = $v s += packnibble5(n) end; s);
  145.011 μs (0 allocations: 0 bytes)

Without simd, it’s already more than 2x as fast. Now let’s vectorize it:

julia> @btime (s = UInt64(0); @simd for n = $v s += packnibble6(n) end; s);
  279.299 μs (0 allocations: 0 bytes)

julia> @btime (s = UInt64(0); @simd for n = $v s += packnibble5(n) end; s);
  33.696 μs (0 allocations: 0 bytes)

More than 8 times as fast…

1 Like