Fastest way to unpack bigendian Int24?

What’s the fastest way to unpack a vector of bigendian 24-bit signed integers stored in a binary file? Suppose the number of Int24s that I need to read is known.

Reading a byte at time is probably pretty fast. It’s possible that reading 32-bits at a time would be faster but seems unlikely to be worth the additional complexity.

My really clumsy solution is to define V = zeros(UInt8, 4) and loop as

for i=1:N
  V[4] = read(fid, UInt8)
  V[3] = read(fid, UInt8)
  V[2] = read(fid, UInt8)
   x[i] = reinterpret(Int32,V)[1] >> 8 

That seems to parse the data correctly but feels like a kludge.

I would skip the V part:

for i = 1:N 
    x[i] = ((read(fid, UInt8) % UInt32) << 16) |
           ((read(fid, UInt8) % UInt32) <<  8) | read(fid, UInt8)

I was under the impression that reading in an entire UInt8 vector at once and then byte swapping on in memory data was faster because reading has to go through filesystem checks for each call.

When you benchmark it, do post the results.


I came across this behavior at the beginning of the year so I’ll try to dig up some code (I may have just been doing things wrong), but in the meantime this issue is where I got the logic about in memory being faster than calling repeated calls to read.

Edit: looking back it turned out what I was doing before was mostly slow for unrelated reasons. Now I use Traceur and am better informed. Sorry for the noise.

I think that issue refers to stdin/stdout specifically. Reading one byte at a time from a file should be fine, since it’s all buffered by the underlying C library. It does however incur a call to the C library for each character read, so there will be a little bit of overhead.

OK, this is weird. The specification claims to use Int24s, which I’ve assumed are bigendian and signed since that’s what they do with Int16 and Int32. So Stefan’s solution or a variant with Ints should work … right?

The weird part: so far, only my solution yields the same values as their precompiled binary, which I know to be correct (e.g., I can visually inspect the seismograms).

So…is what I’m doing even consistent with a bigendian Int24? I think so but I don’t actually know that’s what they did. My solution was obtained via. guess-and-check; the format spec is a two-page PDF that can’t be Googled because the data format is officially named “win32”.

I was converting to UInt32 rather than Int32. Is that the difference? That does somewhat complicate things if they are using 24-bit two’s complement format. In that case I think this code might do the trick:

for i = 1:N
    y::UInt32 = 0
    y |= read(fid, UInt8) << 24
    y |= read(fid, UInt8) << 16
    y |= read(fid, UInt8) << 8
    x[i] = (y % Int32) >> 8

What this code is doing is reading the data into a UInt32 in the high 24 bits (y is a typed local) and then reinterpreting that as an Int32 and doing a signed shift.

1 Like

Sorry, I didn’t get that right (serves me right for not testing). This should be better:

for i = 1:1
    y  = UInt32(read(fid, UInt8)) << 24
    y |= UInt32(read(fid, UInt8)) << 16
    y |= UInt32(read(fid, UInt8)) << 8
    x[i] = (y % Int32) >> 8

Yes! That worked! I’d need more benchmarking for a definitive answer but I think that yields a slight speed improvement.

For reference, this is the entirety of their Int format description: “…the sampled data are stored in the field #13 as the differences of sequential value (i.e. delta compression). Suitable data size of one sample in #13 is chosen and given in #10 among 0.5, 1 2, 3, and 4 octets (=bytes) to minimize data size.”

Ok, so it seems that they’re using a two’s complement signed integer representation. I kind of doubt that the bit twiddling here is going to be the difference—this seems like it would be I/O bound. It’s possible that using mmap might be faster.

Makes sense, that’s what they did with Int4…thanks again for all your help.

I’m not sure if this helps, but if all the Int24’s are packed together, buffering on the julia side is really helpful for these kind of small reads:

# Read `N` big endian two's complement `Int24`s, buffered version
function read_int24(io, N)
    buf = read(io, 3*N)
    x = Vector{Int32}(undef, N)
    for i=1:length(x)
        @inbounds x[i] = reinterpret(Int32, (UInt32(buf[3*(i-1)+1]) << 24) |
                                            (UInt32(buf[3*(i-1)+2]) << 16) |
                                            (UInt32(buf[3*(i-1)+3]) << 8) ) >> 8

# Read `N` big endian two's complement `Int24`s, unbuffered version
function read_int24_unbuffered(io, N)
    x = Vector{Int32}(undef, N)
    for i=1:length(x)
        @inbounds x[i] = reinterpret(Int32, (UInt32(read(io, UInt8)) << 24) |
                                            (UInt32(read(io, UInt8)) << 16) |
                                            (UInt32(read(io, UInt8)) << 8) ) >> 8

# Create test data
function write_int24(io, x)
    buf = Vector{UInt8}(undef, 3*length(x))
    for i=1:length(x)
        y = reinterpret(UInt32, Int32(x[i]) << 8)
        buf[3*(i-1)+1] = (y >> 24) % UInt8
        buf[3*(i-1)+2] = (y >> 16) % UInt8
        buf[3*(i-1)+3] = (y >> 8) % UInt8
    write(io, buf)

open("test.dat", "w") do io
    write_int24(io, -1000_000:1000_000)

The explicitly buffered version is 10x faster:

julia> io = open("test.dat", "r")
IOStream(<file test.dat>)

julia> @btime read_int24($io, 2_000_001) setup=(seek($io,0)) samples=1 evals=1;
  1.534 ms (4 allocations: 13.35 MiB)

julia> @btime read_int24_unbuffered($io, 2_000_001) setup=(seek($io,0)) samples=1 evals=1;
  15.000 ms (2 allocations: 7.63 MiB)

A lot of instruments have a packet-based format, where the packet header stores the size in bytes of the rest of the data packet. In this case it’s a good idea to read the whole packet into an IOBuffer and proceed from there. This way you recover a more reasonable performance and preserve the convenience of using the IO abstraction:

julia> @btime read_int24_unbuffered(IOBuffer(read(io, 3*2_000_001)), 2_000_001) setup=(seek($io,0)) samples=1 evals=1;
  3.953 ms (5 allocations: 13.35 MiB)

Edit: I should point out that benchmarks for this kind of thing will depend a lot on whether the file is in the operating system’s page cache. Whether this is true depends on your use case.


Yes, that is indeed quite a big difference!

It’s probably relevant to differentiate between the buffering taking place in the underlying C library, which means that we don’t need to read one byte at a time from the file system, and the buffering that we can do within Julia, which means that we don’t need to call into the C library for each byte.

Your final statement matters here; I think the speedup we see by buffering within Julia mainly applies to cached files (files already present in memory). In your example, a 6 MB file is processed in 1.5 ms, i.e. ~4 GB per second, which is a bit beyond what we can expect if reading from disk.

Surprisingly it looks like some SSDs are nearly this fast. For example the Intel 7600P supposedly has a sequential read speed of ~3 GB/s. So doing it the slow way could be noticeable even for files which aren’t in memory.

1 Like