Efficient way to go from bit-packed value to `Vector{T<:Unsigned}`

xiaodai · May 25, 2020, 1:45am

I am trying to write a faster Parquet reader (faster than the one in Parquet.jl atm).

One interesting problem that I encountered is this: How to efficiently unpack a set of bit-packed integers into Vector{T<:Unsigned} for some T?

The data is stored in a bit-packed way explained here (see point 1).

See exerpt

So basically given a Vector{UInt8} of bit-packed values. How do I convert it back Vector{T<:Unsigned}

For example, if

bit_packed = UInt8[161]

and say the bitwidth is 1, i.e every bit represents a value.

The binary form of 161 is 10100001, so it should unpack to

UInt8[1, 0, 0, 0, 0, 1, 0, 1]

For this case, I think digits works very well, e.g.

reverse(digits(UInt8(161), base = 2, pad = 8))

For bitwidth greater than 1, I can do bitshift and then do a similar trick.

E.g. consider the pictured example where bitwidth = 3 and I have bit-packed

unpacked = UInt8[0, 1, 2, 3, 4, ,5, 6, 7]

which gets stored packed as

UInt8[136, 198, 250]

nth_values(bitpacked, n, bitwidth) = begin
    result = UInt(0)
    from_bits = (n - 1)*bitwidth + 1 
    to_bits = n*bitwidth

    from_byte = ceil(Int, from_bits /  8)
    to_byte = div(to_bits - 1, 8)
    to_byte = to_byte + 1

   for byte in @view bitpacked[to_byte:-1:from_byte]
      result = result << 8
      result = result | byte
   end
   
   result = result >> (mod(from_bits-1, 8))
   result & UInt(2^bitwidth - 1)
end

nth_values.(Ref(bitpacked), 1:8, 3)

which yields

UInt8[0, 1, 2, 3, 4, 5, 6, ,7]

but I feel it’s a bit unsatisfactory and may be quite slow. I wonder if there are well known algorithm for faster unpacking?

Oscar_Smith · May 25, 2020, 1:48am

Do you actually need to unpack? It seems like one approach would be to define a type of 3 bit integer arrays.

xiaodai · May 25, 2020, 1:49am

is that even possible? How then do I sort it etc?

rdeits · May 25, 2020, 1:59am

What about something like this approach: Reinterpret packed data which is already in memory? - #2 by stevengj ?

xiaodai · May 25, 2020, 2:02am

Feels like it only works if the data is a mulitple of 8 bits. The bitpacked values can be any number and not necessarily a multiple of 8.

xiaodai · May 25, 2020, 2:04am

Another problem is that the bitwidth is not a factor of 8 and may not be a multiple of 8 so one value can be stored across two bytes.

stevengj · May 25, 2020, 2:52am

Just break the data into chunks of the least-common multiple lcm(8, bitwidth) ÷ 8 bytes, and then use bit-shift/mask operations.

For example, in your 3-bit example, I would loop through the data in groups of 3 bytes, and pull out 8 numbers from each 3-byte block.

xiaodai · May 25, 2020, 3:26am

Thanks for the suggestion. I was thinking something along the same line, but the bitwidth can be as large as 32. So say my bitwidth is 17 so I can store it in a primitive type like this

primitive type Foo lcm(17, 8) end

and lcm(8, 17) happens to be 136. But there is no bitshift operation defined on it and there is also no &

E.g.

bitpicked = UInt8[1:8...]
bitpacked_val = unsafe_load(Ptr{UInt}(pointer(bitpicked)))
bitpacked_val & UInt(2^17-1)

works but

bitpacked_val = unsafe_load(Ptr{Foo}(pointer(bitpicked)))
bitpacked_val & UInt(2^17-1)

doesn’t work.

Is there a way define shifts and mask for these primitive types?

xiaodai · May 25, 2020, 3:29am

That would require defining a bitshift operation on 24bits or I need to Base.zext_int to a 32 bit integer before I can do the operations. But seems like I am out of luck for bitwidth where lcm(bitwidth, 8) > 128.

One thing I can do is to treat those as special and anything smaller, I can just Base.zext_int into the next size for pulling out numbers.

datnamer · May 25, 2020, 4:06am

I hardly know anything about this low level programming, but would this package be helpful?GitHub - RelationalAI-oss/Blobs.jl: Binary blobs with on-the-fly pointer patching

xiaodai · May 25, 2020, 4:44am

Thanks. I thinkit falls into the same category as previous suggestions in that it doesn’t to be able to deal with high bitwidth structures.

Topic		Replies	Views
Simple bit manipulation question New to Julia	6	1336	March 1, 2018
Packing and unpacking an array of N-bit ints Performance	5	143	January 27, 2025
Convert integer to bits array New to Julia	12	16767	October 14, 2022
I have: Vector{UInt8}. I need: BitVector General Usage	6	3564	May 25, 2022
Convert UInt array to bitarray General Usage convert	4	1838	August 1, 2021

Efficient way to go from bit-packed value to `Vector{T<:Unsigned}`

Related topics