Efficiently interpreting byte-packed buffer

Also, for anyone like myself who is new to generated functions, instead of writing an extra function to get the type, as in

the @generated macro was fine with using a parameterized function signature, to get T,

@generated function struct_at_index2(::Type{T}, buf::Vector{UInt8}, i) where {T}

CBinding.jl works well for these use cases as well, especially when there is a C header/type definition of the data or if different alignment/packing is used. It might even perform better as well…

julia> using CBinding

       c``

       c"""
       #include <stdint.h>
       """s

       const c"uint16_t" = UInt16
       const c"uint32_t" = UInt32

       c"""
       struct Tlm {
           uint16_t a;
           uint32_t b, c, d, e, f, g, h, i, j, k;
       } __attribute__((packed));
       """j

julia> sizeof(Tlm)
42

julia> function testfunc(ptr::Cptr{Tlm})
           sum(1:100) do i
               ptr[i].a[] + ptr[i].d[]
           end
       end

julia> buf = rand(UInt8, 42*100);

julia> @btime testfunc($(Cptr{Tlm}(pointer(buf))))
  56.697 ns (0 allocations: 0 bytes)
0x00000030a830eaa5

julia> @code_native testfunc(Cptr{Tlm}(pointer(buf)))
        .text
; ┌ @ REPL[11]:1 within `testfunc'
        pushq   %rax
; │ @ REPL[11]:2 within `testfunc'
        movq    %rdi, (%rsp)
; │┌ @ reducedim.jl:874 within `sum'
; ││┌ @ reducedim.jl:874 within `#sum#680'
; │││┌ @ reducedim.jl:878 within `_sum'
; ││││┌ @ reducedim.jl:878 within `#_sum#682'
; │││││┌ @ reducedim.jl:310 within `mapreduce'
; ││││││┌ @ reducedim.jl:310 within `#mapreduce#672'
; │││││││┌ @ reducedim.jl:318 within `_mapreduce_dim'
        movabsq $.rodata.cst16, %rsi
        movabsq $_mapreduce, %rax
        movq    %rsp, %rdi
        callq   *%rax
; │└└└└└└└
        popq    %rcx
        retq
; └

1 Like

I like keeping it simple. However, I am trying too keep things generic (I have many telemetry types and for code hygiene don’t want to write a function like

that requires manually specifying all of the parameters and types again (only want to do that once, at the struct definition). That said, maybe there is a way to write this generically that gets the original performance (which was on par with the generated function method). I’ll give it a try.

Thanks!

Let me introduce you all to the magical wonders of @nexprs:

julia> function example(buf)
           a = read(buf, UInt16)
           Base.Cartesian.@nexprs 9 i -> (a_i = read(buf, UInt32))
           (a, a_1,a_2,a_3,a_4,a_5,a_6,a_7,a_8,a_9)
       end
example (generic function with 1 method)

julia> example(IOBuffer([0x00 for _ in 1:100]))
(0x0000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000)

julia> using BenchmarkTools

julia> @benchmark example(io) setup=(io=IOBuffer([0x00 for _ in 1:100])) evals=1
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  40.000 ns …  1.765 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     44.000 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   50.773 ns ± 29.007 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▅▇██▇▅▄▃▃▃▃▂▂▂▂▂▁▁▂▃▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▁ ▁                    ▂
  ████████████████████████████████████████▇▇▇▆▇▇▇▆▆▇▆▇▆▆▆▅▆▅▅ █
  40 ns        Histogram: log(frequency) by time       102 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

The only downside is that you have to specify the number of variables for @nexprs as a literal, so no automatic generation (well aside from in a generated function…).

As an aside, you won’t get around manually writing the reinterpretation in quite a few cases anyway, since all this reinterpret business is really only kosher for isbits types, i.e. no pointers. For those you will have to go through constructors.

2 Likes

Ah yeah that looks better, I was kind of noodling my way through there.

The answer is in this post Constraints for `@generated` function - #3 by Chris_Foster

The generated function is not allowed to introduce a new type, and that’s what happens with a closure.

1 Like

My macro-fu is weak, but here we go

using BenchmarkTools

function readstruct(T, buffer)
    fns = fieldnames(getfield(@__MODULE__, T))
    fts = fieldtypes(getfield(@__MODULE__, T))
    stmts = []
    for (fn, ft) in zip(fns, fts)
        push!(stmts, :($(esc(fn)) = read($(esc(buffer)), $ft)))
    end
    args = [:($(esc(fn))) for fn in fns]
    push!(stmts, :($T($(args...))))
end

macro readstruct(T, buffer)
    stmts = readstruct(T, buffer)
    quote $(stmts...) end
end

struct TLM
    a::UInt16
    b::UInt32
    c::UInt32
    d::UInt32
    e::UInt32
    f::UInt32
    g::UInt32
    h::UInt32
    i::UInt32
    j::UInt32
    k::UInt32
end

function TLM(buffer::IOBuffer)
    # @show @macroexpand(@readstruct TLM buffer)
    @readstruct TLM buffer
end

function unpack(::Type{T}, buffer::IOBuffer, buffer_lock) where {T}
    newT = lock(buffer_lock) do
        T(buffer)
    end
    return newT
end

@btime unpack(TLM, IOBuffer(buffer), lock) setup=(
    buffer = zeros(UInt8, 42);
    lock = ReentrantLock())

yielding

  74.097 ns (1 allocation: 64 bytes)

Critique welcome!

Edit: for everyone asking (like me) about the difference between IOBuffers read and reinterpret.

And since we opened the gates to the paradise of Base.Cartesian let’s go all the way in:

function unpack(::Type{T}, buffer::IOBuffer, buffer_lock) where T
    return lock(buffer_lock) do
        unsafe_unpack(T, buffer)
    end
end

@generated function unsafe_unpack(::Type{T}, buffer) where T
    ftypes = fieldtypes(T)
    l = length(ftypes)
    return :(begin
        Base.Cartesian.@nexprs $l i->a_i = read(buffer, $(ftypes)[i])
        Base.Cartesian.@ncall $l $T a
    end)
end

@nhardy should be general enough for all isbits types :wink:

struct TLM
    a::UInt16
    b::UInt32
    c::UInt32
    d::UInt32
    e::UInt32
    f::UInt32
    g::UInt32
    h::UInt32
    i::UInt32
    j::UInt32
    k::UInt32
end

using BenchmarkTools

@benchmark unpack(TLM, IOBuffer(buffer), lock) setup=(
           buffer=zeros(UInt8, 42);
           lock=ReentrantLock())

gives me

101.729 ns (1 allocation: 64 bytes)

That’s consistently ~10% faster (at least on my laptop :P), in both minimum and median times than the slightly more complicated version of @goerch.

However I particularly like the version by @krrutkow for the extremely short @code_native. What we get here, while fast, is not pretty :stuck_out_tongue:

2 Likes

Hi @jules , @abulak,

being relatively new to the world of meta-programming in Julia. What is your advice to switch from macro to generated? I’ve seen some older advice regarding unrolling w.r.t. to compiler weaknesses as well as advice to avoid generated. Is there any rule of the thumb currently?

Thanks in advance!

Thanks again to the Julia community, these are some fantastic suggestions - had never come across generated functions before this. The initial suggestion from @jules got me over the hump, but I am going to implement some of the others for comparison. I especially like @abulak’s suggestion - the combination of IOBuffer reads, @nexpr, and @generated gives concise, clean code that I have a hope of understanding when I inevitably revisit it in 3 years.

I would say that @generated is exactly for cases like this one here. We can’t easily achieve type stability with normal functions because we have to loop over differently typed fields of a struct, and as we could see in the first attempts, the intermediary Tuples and splatting lead to unnecessary overhead. This pattern, where looping over a small but heterogenous collection is type unstable, is usually what leads to manual unrolling.

You can’t use simple macros here easily, because the code we want to write depends on input types to the functions. Macros are for cases where the code you want to write depends only on some expression. (You could eval stuff in a macro to approximate what @generated does but that’s very much not advisable, macros should be pure syntax transformations).

So here the idea is just to write out the constructor as we would manually do, so the compiler has all information directly available.

I didn’t know the macros from Base.Cartesian yet, these help not having to write out the expressions like I did.

2 Likes

You can avoid the whole @generated issue like this:

function unsafe_unpack(::Type{T}, buffer) where T
    fieldvals = ntuple(i -> read(buffer, fieldtype(T, i)), Val(fieldcount(T)))
    return T(fieldvals...)
end

This gives me the same performance as the @generated code above, and is quite a bit more transparent.

2 Likes

And wouldn’t you know it, the specific version with Val that you’re using here is also doing something generated https://github.com/JuliaLang/julia/blob/aa497b579fdfcd96608e5753f2806041424a3cba/base/ntuple.jl#L69

Pretty much the same thing as in one of the solutions above. I would argue it’s not necessarily super transparent that you can do this with Val, although the code is of course shorter if the complexity of the inner function is not visible at the surface.

In general I would say that it can be hard to predict with such functions which one’s going to be inferred and compiled best, and it’s honestly a little bit of trial and error.

1 Like

That’s what macros are for. I’m surprised that nobody has proposed a macro that acts on the struct definition yet, so let’s add one to the mix. This can absolutely be improved with respect to robustness and safety but the performance should be fine.

macro magic(ex)
    inner_constructor = quote
        function $(ex.args[2])(buffer::Vector{UInt8})
            new()
        end
    end

    new_args = inner_constructor.args[2].args[2].args[3].args
    i = 1
    for field in ex.args[3].args
        if field isa Expr
            data_type = field.args[2]
            push!(new_args,
                  :(unsafe_load(Ptr{$(data_type)}(pointer(buffer, $i)), 1)))
            i += sizeof(getfield(Main, data_type))
        end
    end
    push!(ex.args[3].args, inner_constructor)
    esc(ex)
end

@magic struct TLM
    a::UInt16
    b::UInt32
    c::UInt32
    d::UInt32
    e::UInt32
    f::UInt32
    g::UInt32
    h::UInt32
    i::UInt32
    j::UInt32
    k::UInt32
end

buffer = rand(UInt8, 42)
tlm = TLM(buffer)
2 Likes

I meant that it is easier to understand what the code is doing. Predicting performance is of course hard, as it often is.

Yes the clarity of the ntuple solution is much better than that of my generated version, however I was surprised that it should perform as well, and only understood why after going to the source. That’s what I meant by “not necessarily super transparent”.

1 Like

What I find additionally interesting is that the ntuple function doesn’t use the @generated macro like we normally do, it uses the no-argument version which just does this:

macro generated()
    return Expr(:generated)
end

And the source code of @generated(f) itself is this:

macro generated(f)
    if isa(f, Expr) && (f.head === :function || is_short_function_def(f))
        body = f.args[2]
        lno = body.args[1]
        return Expr(:escape,
                    Expr(f.head, f.args[1],
                         Expr(:block,
                              lno,
                              Expr(:if, Expr(:generated),
                                   body,
                                   Expr(:block,
                                        Expr(:meta, :generated_only),
                                        Expr(:return, nothing))))))
    else
        error("invalid syntax; @generated must be used with a function definition")
    end
end

So ntuple kind of uses part of this machinery which I don’t quite understand, this Expr(:if, Expr(:generated),. Maybe someone knows what Julia actually does with this?

Nice solution! You don’t even need ntuple then and can write the original unpack like

function unpack(::Type{T}, buffer::IOBuffer, buffer_lock) where {T}
    newT = lock(buffer_lock) do
        T((read(buffer, fieldtype(T, i)) for i in 1:fieldcount(T))...)
    end
    return newT
end

@btime unpack(TLM, IOBuffer(buffer), lock) setup=(
    buffer = zeros(UInt8, 42);
    lock = ReentrantLock())

yielding

  74.691 ns (1 allocation: 64 bytes)

with one allocation for the IOBuffer remaining…

1 Like

That’s the syntax for optionally-generated functions, which allows one to provide an alternative non-@generated implementation and let the compiler decide on which one to use.

1 Like

Huh. So the ntuple solution from @DNF makes sense (uses @generated under the hood), but I’m surprised the straight tuple solution from @goerch works. Regardless, and interestingly, I found that for both approaches they work when iterating over the number of elements and calling fieldtype(T, i) in the iteration, but if you do what I originally did and first get the fieldtypes and then reference them, it is inefficient (extra allocations and slower). That is,

T((read(buffer, fieldtype(T, i)) for i in 1:fieldcount(T))...)

is performant, but

fts = fieldtypes(T)
T((read(buffer, fts[i]) for i in 1:fieldcount(T))...)

and

fts = fieldtypes(T)
T((read(buf, ft) for ft in fts)...)

are not. It is worth noting that this is not the case in the explicitly generated versions, likely because the fieldtypes call happens in the @generated block and each fieldtype is made explicit in the code generation / unrolling from @nexprs.

I am not sure if I prefer these simple tuple/ntuple versions or the @generated versions. Simple is great, but it seems like they are relying on hidden compiler optimizations that - for now - mirror the explicitly generated version, but are a little fragile. Thoughts? I feel like the generated version is more likely to produce the same performance down the road (when I inevitably need to change something) even if the syntax is a little harder to parse.

Thanks again everyone for all of the great input.