Preferred method of loading binary files?

I am currently writing code to read from a binary file format. So the file has a structure and I have structs that mirror that structure and I would like to read the data from file and fill my structs with it.

What is the preferred way of doing that? I can just load the data in the struct by using a read! and a Ref{} to a struct, but that seems like it might fail catastrophically if Julia decides to add padding bytes to the struct.

Should I just loop through the Fields and read the corresponding types from the file iteratively?

The best technique depends on the file format specification.

If you know that the file format is a direct dump of structs which follow the C ABI, then the padding will be in the file and you can just read it directly as you’re suggesting. For this to work the file needs to be read and written on the same architecture / OS so it’s a little brittle.

If you know that all structs in the file were written without padding, I’d suggest just reading the fields sequentially one by one. After you’ve read the fields, pass them to the constructor for your struct type.

This second option is very flexible but it can be rather verbose if you need to implement it for many different structs. If that’s the case you could consider some code generation using reflection facilities like fieldnames and fieldtype.

Which format are you trying to parse?

1 Like

Thanks! I am trying to load some archive/map data of some old games for visualization, so those formats are obviously not well specified.

Guess I will just have a look of how they behave and write code accordingly. Since the main goal of all of this is to get into Julia a bit more I would prefer to not just hack something together and have a nice Julian solution :slight_smile:

Using some code generation using fieldnames and fieldtypes seems like an elegant solution.

That makes sense; you’ll basically have to reverse engineer these formats then.

My suggestion would be to start with writing out the loader code explicitly as a big lot of read(io, T) while you work on understanding the format. Once you start to see patterns you’ll know whether it’s possible to attack it with a code generator or not. The reason I say this is that many ad hoc binary formats are rather “clever” in being optimized for file size, access time, or backward compatibility and there can be a ton of special cases which are not well suited to code generation. YMMV :slight_smile:

2 Likes