Reading binary file into a Vector of custom struct

I am trying to find the best way to read a binary file directly into a vector of struct.

The binary file has a format so that it could be read in as a vector of objects with format like so (where the file may have any number of such objects appended one after another)

using StaticArrays

struct BinObj
    head1::UInt32
    head2::Float32
    ⋮
    data::SMatrix{40,200,Float32,8000}
    foot::UInt32
end

And I have been reading in as

reinterpret(BinObj, read("path/to/the/binary.file"))

However, since the data portion is such a large matrix, doing anything with any of the resultant SMatrix is very slow. I want to read the data directly into a vanilla Julia array, but how do I let the interpreter know how many bits to allocate to these arrays?

Secondly, the reinterpret call produces a reinterpret vector object, eg:

20-element reinterpret(BinObj, ::Vector{UInt8}):

How can I read directly into an vector of BinObj without the intermediate reinterpret object? Is this possible without doing a copy operation, which would allocate too much memory?

How about this read method

io = open("path/to/the/binary.file")
read(io, BinObj)

This yields

ERROR: The IO stream does not support reading objects of type BinObj

I presume because the binary file has multiple appended BinObjs and not just one?

Oh, oops, I think those methods might only be for built-in types. I don’t know why reinterpret slows this down, is the problem that accessing data inside a reinterpret is much slower than accessing data from a directly constructed BinObj?

That’s not really the huge issue, and I can live with reinterpret overhead if need be.

The big problem is that Static Arrays are very slow if they have more than ~100 elements, but I cannot think of any other way to make my struct as a bits type that reinterpret can read into (even though I know a priori how large my array should be). I really want the data element to be a vanilla Julia array.

That makes sense. I think your best option is probably to write a function that performs multiple reads to construct a BinObj, then it can read data into an ordinary array and construct the final object from all the reads. Keep calling that function until EOF and appending to your final array of BinObj.

Another option might be to read into one array, and then construct the BinObj from reinterpreted slices, but I’m not sure that won’t allocate more.

I have had a similar problem when trying to read a vector of Float32 values. That is obviously simpler than your compound struct but maybe the same technique would work?

I ended up using the read! function like so

n = div(filesize(filename),sizeof(BinObj))
read!(filename,Vector{BinObj}(undef,n))

This calculates from the file how many BinObj objects are in the file so it can allocate a Vector of them of the correct size. The read! call then stuffs the binary data into that Vector directly.

I’m not sure if this will actually work with a compound struct but worth a try.

1 Like

@Jordan_Cluts Thanks, this solves my secondary question about skipping the reinterpret and not allocating more than I need.

@contradict Thanks, I will give that approach a try after my morning meetings

@contradict and @Jordan_Cluts Thank you for your help!

This is the current way I have approached this problem. I’m sure it can be cleaned up some more, but for the time being this seems to work:

struct BinObj
    head1::UInt32
    head2::Float32
    ⋮
    data::Matrix{Float32} # Has size (40, 200)
    foot::UInt32
end # Has total size BINOBJSIZE

function _readIntoBinObj(iostream)
    head1 = read(iostream, UInt32)
    head2 = read(iostream, Float32)
    ⋮
    data = read!(iostream, Matrix{Float32}(undef, 40, 200))
    foot = read(iostream, UInt32)
    return BinObj(head1, head2, ..., data, foot)
end

function getBinObjVector(filename)
    n = filesize(filename) ÷ BINOBJSIZE
    io = open(filename)
    return [_readIntoBinObj(io) for i in 1:n]
end