I am trying to find the best way to read a binary file directly into a vector of struct.
The binary file has a format so that it could be read in as a vector of objects with format like so (where the file may have any number of such objects appended one after another)
using StaticArrays
struct BinObj
head1::UInt32
head2::Float32
⋮
data::SMatrix{40,200,Float32,8000}
foot::UInt32
end
And I have been reading in as
reinterpret(BinObj, read("path/to/the/binary.file"))
However, since the data portion is such a large matrix, doing anything with any of the resultant SMatrix is very slow. I want to read the data
directly into a vanilla Julia array, but how do I let the interpreter know how many bits to allocate to these arrays?
Secondly, the reinterpret call produces a reinterpret vector object, eg:
20-element reinterpret(BinObj, ::Vector{UInt8}):
How can I read directly into an vector of BinObj without the intermediate reinterpret object? Is this possible without doing a copy
operation, which would allocate too much memory?
How about this read method
io = open("path/to/the/binary.file")
read(io, BinObj)
This yields
ERROR: The IO stream does not support reading objects of type BinObj
I presume because the binary file has multiple appended BinObj
s and not just one?
Oh, oops, I think those methods might only be for built-in types. I don’t know why reinterpret
slows this down, is the problem that accessing data
inside a reinterpret
is much slower than accessing data
from a directly constructed BinObj
?
That’s not really the huge issue, and I can live with reinterpret overhead if need be.
The big problem is that Static Arrays are very slow if they have more than ~100 elements, but I cannot think of any other way to make my struct as a bits type that reinterpret can read into (even though I know a priori how large my array should be). I really want the data
element to be a vanilla Julia array.
That makes sense. I think your best option is probably to write a function that performs multiple reads to construct a BinObj
, then it can read data
into an ordinary array and construct the final object from all the reads. Keep calling that function until EOF and appending to your final array of BinObj
.
Another option might be to read into one array, and then construct the BinObj from reinterpreted slices, but I’m not sure that won’t allocate more.
I have had a similar problem when trying to read a vector of Float32 values. That is obviously simpler than your compound struct but maybe the same technique would work?
I ended up using the read!
function like so
n = div(filesize(filename),sizeof(BinObj))
read!(filename,Vector{BinObj}(undef,n))
This calculates from the file how many BinObj objects are in the file so it can allocate a Vector of them of the correct size. The read!
call then stuffs the binary data into that Vector directly.
I’m not sure if this will actually work with a compound struct but worth a try.
1 Like
@Jordan_Cluts Thanks, this solves my secondary question about skipping the reinterpret and not allocating more than I need.
@contradict Thanks, I will give that approach a try after my morning meetings
@contradict and @Jordan_Cluts Thank you for your help!
This is the current way I have approached this problem. I’m sure it can be cleaned up some more, but for the time being this seems to work:
struct BinObj
head1::UInt32
head2::Float32
⋮
data::Matrix{Float32} # Has size (40, 200)
foot::UInt32
end # Has total size BINOBJSIZE
function _readIntoBinObj(iostream)
head1 = read(iostream, UInt32)
head2 = read(iostream, Float32)
⋮
data = read!(iostream, Matrix{Float32}(undef, 40, 200))
foot = read(iostream, UInt32)
return BinObj(head1, head2, ..., data, foot)
end
function getBinObjVector(filename)
n = filesize(filename) ÷ BINOBJSIZE
io = open(filename)
return [_readIntoBinObj(io) for i in 1:n]
end