I am trying to write a faster Parquet reader (faster than the one in Parquet.jl atm).
One interesting problem that I encountered is this: How to efficiently unpack a set of bit-packed integers into Vector{T<:Unsigned}
for some T
?
The data is stored in a bit-packed way explained here (see point 1).
See exerpt
So basically given a Vector{UInt8}
of bit-packed values. How do I convert it back Vector{T<:Unsigned}
For example, if
bit_packed = UInt8[161]
and say the bitwidth is 1, i.e every bit represents a value.
The binary form of 161 is 10100001
, so it should unpack to
UInt8[1, 0, 0, 0, 0, 1, 0, 1]
For this case, I think digits
works very well, e.g.
reverse(digits(UInt8(161), base = 2, pad = 8))
For bitwidth greater than 1, I can do bitshift and then do a similar trick.
E.g. consider the pictured example where bitwidth = 3
and I have bit-packed
unpacked = UInt8[0, 1, 2, 3, 4, ,5, 6, 7]
which gets stored packed as
UInt8[136, 198, 250]
nth_values(bitpacked, n, bitwidth) = begin
result = UInt(0)
from_bits = (n - 1)*bitwidth + 1
to_bits = n*bitwidth
from_byte = ceil(Int, from_bits / 8)
to_byte = div(to_bits - 1, 8)
to_byte = to_byte + 1
for byte in @view bitpacked[to_byte:-1:from_byte]
result = result << 8
result = result | byte
end
result = result >> (mod(from_bits-1, 8))
result & UInt(2^bitwidth - 1)
end
nth_values.(Ref(bitpacked), 1:8, 3)
which yields
UInt8[0, 1, 2, 3, 4, 5, 6, ,7]
but I feel it’s a bit unsatisfactory and may be quite slow. I wonder if there are well known algorithm for faster unpacking?