Packing and unpacking an array of N-bit ints

Given an IO, I want to read N bits into each index of an Array{UInt64} and vice-versa: For example, the data [0b0_101_100_011_010_001_000_111_110_101_100_011_010_001_000_111_110_101_100_011_010_001, 0b00_010_001_000_111_110_101_100_011_010_001_000_111_110_101_100_011_010_001_000_111_11] becomes UInt64[1, 2, 3, 4, 5, 6, 7, 0, 1, 2, 3, 4, 5, 6, 7, 0, 1, 2, 3, 4, 5, 6, 7, 0, 1, 2, 3, 4, 5, 6, 7, 0, 1, 2, 3, 4, 5, 6, 7, 0, 1, 2]. Here are my current functions:

function unpack(compressed::IO, bits::Int64, size::Int64)
  data = zeros(UInt, size)
  shift = 0
  mask = 2^bits - 1
  current = read(compressed, UInt64)
  for i in 1:length(data)
    data[i] += (current >>> shift) & mask
  if shift + bits > 64
    current = read(compressed, UInt64)
    data[i] += (current >>> (shift - 64)) & mask
    shift = shift + bits - 64
    elseif shift + bits == 64
      current = read(compressed, UInt64)
    shift = 0
    else
      shift += bits
    end
  end
  return data
end

function pack(io::IO, uncompressed::Array{UInt64}, bits::Int64)
  write(io, BitArray(n >>> i & 1 == 1 for n in uncompressed for i in 0:bits - 1).chunks)
end

B = rand(UInt64, 256^3); path, io = mktemp(cleanup=true);
@btime pack($io, $B, 3) # 135.244 ms (28 allocations: 17.76 MiB)

A1 = pack(B, 3); L = length(A1); A = IOBuffer(reinterpret(UInt8, A1));
@btime (seek($A, 0); unpack($A, 3, L Ă· 3)) # 1.278 ms (4 allocations: 2.00 MiB)

Can anyone think of a faster way of doing this? I was thinking there may be some way to use reinterpret but couldn’t get anything to work. Any help would be appreciated!

Is N (I guess the same as your bits parameter) an arbitrary number? In what range? What is the application here? The problem is easier to optimize for fixed N, of course.

Note that reading data in tiny chunks like this is not very efficient on an arbitrary IO stream. You might want to use BufferedStreams.jl, or simply read a big chunk of bytes at once.

It becomes a lot easier if N divides 8, of course.

1 Like

Is the motivation for this particular packed format speed or space or is it for legacy interop? If it’s either of the first two, I’d try a bog-standard general purpose streaming compression like TranscodingStreams.jl and CodecZlib.jl. Sure, encoding the constraints of your domain into the compression itself means you can theoretically do better on both fronts, but often an 80% solution is good enough!

1 Like

Thanks for the reply!

Is N (I guess the same as your bits parameter) an arbitrary number? In what range? What is the application here? The problem is easier to optimize for fixed N, of course.

N is an arbitrary number in 2:9. The specific purpose of this is to decode the litematic file format, which stores voxel information in this packed format with N chosen as the smallest amount of bits large enough to represent all the needed voxel types. It’s feasible for me to have a specific method for each N if it would achieve higher performance; After reading your answer I’m realizing I should definitely have a special method for N=8.

Note that reading data in tiny chunks like this is not very efficient on an arbitrary IO stream. You might want to use BufferedStreams.jl, or simply read a big chunk of bytes at once.

In that case, would it be good to read the whole file first and then operate on that? If not, how many bytes at a time might be best?

It becomes a lot easier if N divides 8, of course.

It seems trivial for N=8, but I wouldn’t know how to do 4 and 2 optimally. Do you have any ideas there?

Thanks for the reply! Unfortunately I must decode this specific data format because it’s how the file format is written, as elaborated further in my other reply. It actually has both types of compression, they first packed the information into N-bit groups and then compressed the output file with GZip.

For N=4, for example, you would read in an array b of bytes of half the length, and then your array is just

j = 1
for i = 1:2:len
    out[i] = b[j] >> 4
    out[i+1] = b[j] & 0x0f
    j += 1
end

for example.

I wouldn’t read in the whole file if it could be arbitrarily large. BufferedStreams has a default buffer size of 128 * 2^10 bytes (128 KiB), which should be reasonable.

Alternatively, you could mmap the file (with the Mmap standard library).

For each N, I suspect that the optimal approach is to loop over chunks of lcm(N,8) Ă· 8 == N Ă· gcd(N,8) bytes at a time, and then have a hard-coded unpacking routine for that chunk, specialized for each N.

The above code is the example where N=2, so the chunk size is 1 byte. For N=3, you would read in chunks of 3 bytes and do something like:

i = 0
for j = 1:3:length(bytes)
    out[i += 1] = bytes[j] >> 5
    out[i += 1] = (bytes[j] >> 2) & 0x07
    out[i += 1] = ((bytes[j] & 0x03) << 1) + (bytes[j+1] >> 7)
    out[i += 1] = (bytes[j+1] >> 4) & 0x07
    out[i += 1] = (bytes[j+1] >> 1) & 0x07
    out[i += 1] = ((bytes[j+1] & 0x01) << 2) + (bytes[j+2] >> 6)
    out[i += 1] = (bytes[j+2] >> 3) & 0x07
    out[i += 1] = bytes[j+2] & 0x07
end

However, before you go to a lot of trouble to optimize reading a legacy file format by specializing 9 different cases, I would first check to make sure the time matters (especially once you switch to a buffered stream and/or reading in large blocks of bytes).

2 Likes