DRY reading binary structured data

hacklint · July 28, 2021, 8:10am

Hi,
I’m trying to create a Julia reader for a binary block-based file format and run into different issues depending on which approach I try. There are many different types of blocks, but essentially they look something like this:

using CBinding
import Base.read

testdata = IOBuffer([0x23, 0x23, 0x4d, 0x44, 0x00, 0x00, 0x00, 0x00,
                     0x0c, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
                     0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
                     0x48, 0x65, 0x6c, 0x6c, 0x6f, 0x20, 0x77, 0x6f,
                     0x72, 0x6c, 0x64, 0x00])

@cstruct HEADER {
    id::UInt8[4]
    reserved::UInt8[4]
    length::UInt64
    link_count::UInt64
};

@cstruct MDBLOCK {
    header::HEADER
    # No links...    
    md_data::UInt8[]
};

function read(io::IO, MDBLOCK)
    header = read(io, HEADER)
    content = read(io, header.length)
    println("READ:" * String(content))
    MDBLOCK(header, content)
end

function read_md(io::IO)
    header = read(io, HEADER)
    content = read(io, header.length)
    println("READ:" * String(content))
    MDBLOCK(header, content)
end

# md = read(testdata, MDBLOCK)
seekstart(testdata)
md2 = read_md(testdata)

The header is always present, and may be followed by a set of links (essentially UInt64s) pointing to other blocks. The payload part may be either a fixed structure OR - as above, in md_data - varying in size. The varying-size payload is where i kind of get stuck and ask for some guidance.

So far I have tried a couple of approaches:
CBindings.jl was where I started, but with the post 1.0 move to pure C syntax I really felt that was going in the wrong direction, essentially re-creating the two-language problem that it is one of Julias goals to solve.

So I tried pinning CBindings.jl to pre-1.0, making it possible to use @cstruct to define my block types, including making them possible to read directly from an IOStream. That works fine for the static payload case where the complete block is specified at compile time. However, I get stuck when I need to overload read(io::IO, MDBLOCK). When calling, the compiler wants to use the auto-generated read method from CBindings, not my new shiny one handling the variable length part
The auto-generated read does - kind of - work, as it reads the header part, but leaved the payload unread). Is there a way to make my own read “more specific” to get it selected at dispatch?

julia> @which read(testdata, MDBLOCK)
read(io::IO, ::Type{CA}) where CA<:Caggregate in CBinding at /Users/klint/.julia/packages/CBinding/9dfDe/src/caggregate.jl:31

Next I tried renaming my read function to read_md and dropping the MDBLOCK type parameter. (Both versions are included in the minimal example) This works as far as reading the variable length payload, but complains that is does not find a matching constructor for MDBLOCK.

julia> 

READ:Hello world
ERROR: LoadError: MethodError: no method matching MDBLOCK(::HEADER, ::Vector{UInt8})
Closest candidates are:
  (::Type{CA})(::Union{typeof(zero), UndefInitializer, Cconst{CA, S} where S, Caggregate, CA}; kwargs...) where CA<:Caggregate at /Users/klint/.julia/packages/CBinding/9dfDe/src/caggregate.jl:15
Stacktrace:
 [1] read_md(io::IOBuffer)
   @ Main ~/proj/julia/mdf/minimal.jl:34
 [2] top-level scope
   @ ~/proj/julia/mdf/minimal.jl:39
in expression starting at /Users/klint/proj/julia/mdf/minimal.jl:39

julia>

So I tried biting the bullet and learning/using StaticArray to define the structs, but unless I am mistaken that does not create readers for the structs I define. With CBindings.jl providing that as-is I really would prefer not to have to repeat the structures by creating read methods explicitly reading each field separately (as was suggested in a post by @c42f here

Am I missing something obvious or am I just asking for too much of Julia? It feels like this should not be too hard a problem.
As in the article linked above I’m learning and would like a “Julian” solution. I of course realise that pinning CBindings to an old version is far from ideal, but from where I stand and what I know that right now looks like the most desirable solution, but that may well be because i dug myself into a hole.

Concrete questions, in some kind of simultaneous decreasing frustration but increasing importance order:

Is there a way / how to make dispatch pick my read function?
Does it make sense to trying to build something potentially useful on an old CBinding version?
Is there a better way forward, eg just accepting a need to duplicate info by reading a field at a time?

Thanks!

c42f · July 29, 2021, 6:53am

You need to be sure to add your method to the correct read function. It’s probably as simple as adding a Base. prefix as in

function Base.read(io::IO, MDBLOCK)
    # ...
end

At least, I’m guessing this is your problem — the read function you defined had the same name, but was technically a different function not sharing a method table with Base.read.

It should be reliable enough - the old version isn’t going away. But obviously you won’t benefit from new features or bugfixes. I’d be inclined to use the new version.

Julia’s type introspection and metaprogramming facilities are more than sufficient to autogenerate a maximally efficient read() for an arbitrary Julia struct (see for example fieldnames and fieldtypes), provided you know the detailed serialization rules. The problem is that serialization rules (padding, endianness, etc) can be complicated and platform dependent, or might even have been hand-rolled on the C side. This is often where the complexity comes in, especially for eg, packetized binary formats emitted by embedded systems.

hacklint · July 29, 2021, 9:01am

c42f:

hacklint:

Is there a way / how to make dispatch pick my read function?

You need to be sure to add your method to the correct read function. It’s probably as simple as adding a Base. prefix as in
function Base.read(io::IO, MDBLOCK)
    # ...
end
At least, I’m guessing this is your problem — the read function you defined had the same name, but was technically a different function not sharing a method table with Base.read.

It seems that what I needed was to use the type selector ::Type{MDBLOCK} rather than just the MDBLOCK type:

function Base.read(io::IO, ::Type{MDBLOCK})

Prefixing with Base. was not strictly needed (since I include Base.read?) but it may be better to include it to be really clear about the intention.

So now I’m back at trying to make the MDBLOCK constructor work for a short-term solution.

I did have a quick look at the CBinding code for read and yes that seems like a sensible approach. In this case the format is a very well defined and standardised file format so the packing and ordering should at least be predictable and stable. Given all that it may make most sense to pursue a StaticArray and introspection solution.
Is there a good example in the codebase for using introspection to create a read method for a type with nested structs?

c42f · July 29, 2021, 10:50am

Ah yes, I missed the import

I usually use the explicit form myself.

I haven’t tried to do this kind of thing for a while, so there may be a useful package for it. But in case you want to roll it yourself, here’s a proof of concept.

# A wrapper to distinguish your custom serilaization from other forms.
# Maybe unnecessary, but allows us to define a read method for all structs
# specific to your serialization, rather than a specific serialization for
# particular structs.
struct MySerialization{IOT}
    io::IOT
end

# Helper function to construct a type from its fields
function _construct(T, vals...)
    T(vals...)
end
function _construct(::Type{T}, vals...) where {T<:Tuple}
    tuple(vals...)::T   # Assumes vals are of correct type (the type assert should be unnecessary)
end

# Generated function to do the actual reading.
# Likely this could be written in more functional form to avoid the `@generated`, but a quick try
# at that didn't generate good code for me.
@generated function Base.read(ser::MySerialization, ::Type{T}) where {T}
    if isstructtype(T)
        ts = fieldtypes(T)
        vars = [Symbol("x$i") for i in 1:length(ts)]
        # Basic in-order, packed serialization - assumes no padding between fields.
        reads = [:($var = read(ser, $t)) for (var,t) in zip(vars,ts)]

        quote
            $(reads...)
            _construct(T, $(vars...))
        end
    else
        # Fall back to read(::IO, T) for non-structs
        quote
            read(ser.io, T)
        end
    end
end

Here’s some example nested types to test with

struct A
    x::Cint
    buf::NTuple{4,Cchar}
end

struct X
    a::A
    b::A
end

# Manually serialize some stuff
io = IOBuffer()
write(io, Cint(0))
write(io, Cchar(1))
write(io, Cchar(2))
write(io, Cchar(3))
write(io, Cchar(4))
write(io, Cint(10))
write(io, Cchar(11))
write(io, Cchar(12))
write(io, Cchar(13))
write(io, Cchar(14))
seek(io,0)

Trying it out:

julia> ser = MySerialization(io);

julia> x = read(ser, X)
X(A(0, (1, 2, 3, 4)), A(10, (11, 12, 13, 14)))

Check that the generated code is what we expect:

julia> @code_lowered read(ser, X)
CodeInfo(
    @ /home/chris/test.jl:13 within `read'
   ┌ @ /home/chris/test.jl:21 within `macro expansion'
1 ─│      x1 = Main.read(ser, A)
│  │      x2 = Main.read(ser, A)
│  └
│  ┌ @ /home/chris/test.jl:22 within `macro expansion'
│  │ %3 = Main._construct($(Expr(:static_parameter, 1)), x1, x2)
└──│      return %3
   └
)

julia> @code_lowered read(ser, A)
CodeInfo(
    @ /home/chris/test.jl:13 within `read'
   ┌ @ /home/chris/test.jl:21 within `macro expansion'
1 ─│      x1 = Main.read(ser, Int32)
│  │      x2 = Main.read(ser, NTuple{4, Int8})
│  └
│  ┌ @ /home/chris/test.jl:22 within `macro expansion'
│  │ %3 = Main._construct($(Expr(:static_parameter, 1)), x1, x2)
└──│      return %3
   └
)

c42f · July 29, 2021, 10:52am

By the way, the above mess with @generated assumes you’re actually looking for maximum performance comparable to a hand-unrolled version. If you’d rather just have code simplicity, you could go with something like the following read implementation instead:

function Base.read(ser::MySerialization, ::Type{T}) where {T}
    if isstructtype(T)
        ts = fieldtypes(T)
        vals = read.(Ref(ser), ts)
        _construct(T, vals...)
    else
        # Fall back to read(::IO, T) for non-structs
        read(ser.io, T)
    end
end

hacklint · July 29, 2021, 11:05am

The “code simplicity” version looks a lot like the CBindings stuff, so that’s what I’ll try.
Thanks a lot!

c42f · July 30, 2021, 5:08am

The “code simplicity” version […] that’s what I’ll try.

I had a gut feeling that this version is quite inefficient, so I measured. It turns out it’s 200x slower for reading X! On my machine it’s ~35 ns for reading X from an IOBuffer, vs ~9700 ns for the simpler version.

So if you care about performance at all, I’d suggest the @generated function version.

I also had another go at making a version in more functional style which is both fast and simple. So far I got to the following which is at least type-stable, but it’s still massively slower (around 30x slower) than the @generated function:

function Base.read(ser::MySerialization, ::Type{T}) where {T}
    if isstructtype(T)
        fields = ntuple(fieldcount(T)) do i
            read(ser, fieldtype(T, i))
        end
        _construct(T, fields...)
    else
        # Fall back to read(::IO, T) for non-structs
        read(ser.io, T)
    end
end

aplavin · July 30, 2021, 11:41am

I took a similar approach when reading SquashFS files:

Structs defined with CBinding.@cstruct and FlagSets (src/sqfs_structs.jl · master · Alexander Plavin / SquashFS.jl · GitLab)
A common generated function to read all fixed-size fields: src/utils.jl · master · Alexander Plavin / SquashFS.jl · GitLab
Custom code to read remaining fields of varying sizes, e.g. src/sqfs_structs.jl · master · Alexander Plavin / SquashFS.jl · GitLab.

Developed that with a pre-1.0 CBinding version. As you say, they later changed/removed those struct definition macros, but of course everything continues to work fine with older versions.

hacklint · July 30, 2021, 11:08pm

Seems I’m in good company going the common fixed / custom varying road then

Thx for the job with looking at the performance, will look at the @generated stuff once
I get all the basic functionality in place. Right now function is more important than speed but we all know how fast the need for speed builds once things work… Added a dict mapping the header ids to “real”
block types so now I read a header, get the id, map to real type, seek back and read
the whole block without knowing in advance what it will be. Will see if I can get rid of the duplicated header read… Makes adding new blocks a breeze - also very happy with how easy it was to get useful output working by implementing show for the header
I did drop the CBinding.@cstruct stuff in favour for StaticArrays as I learned above how
to automate the reading (feels a tiny bit cleaner and avoids my issues with calling the generated constructors)

Topic		Replies	Views
Unpacking binary data into a Julia struct General Usage question , binaryio	7	2527	November 25, 2020
Reading byte-aligned struct from binary file New to Julia binaryio	2	1351	April 15, 2021
Reading binary file to Julia New to Julia question	10	593	August 25, 2023
Reading binary file in julia 1.0 New to Julia binaryio	13	7800	August 29, 2019
Binary_reading New to Julia binaryio	8	1693	June 27, 2019

DRY reading binary structured data

Related topics