Efficiently interpreting byte-packed buffer

tl;dr

  1. Is there a built-in way to interpret (or instantiate from) a byte-packed buffer to a Julia struct?
  2. If not, is there an efficient, non-allocating method to instantiate a new struct from the buffer data?

I am replacing (and trying to add significant functionality to) some legacy Matlab and python code that reads tcp telemetry from some instrumentation. The messages are byte-packed (which is out of my power to change). It was straightforward to read the incoming packets to a byte-array, but I am struggling to make that data quickly readable by downstream consumers.

The read-in buffer will be rapidly updated and a number services will access the data asynchronously. Each acquires a lock before doing so (which prevents buffer updates), so I want the access to be as fast as possible. My initial hope was to just use reinterpret to view the data as a Julia struct (making accessing the data simple), but because the data is byte-packed that does not generally work. For instance, when trying to interpret a buffer as

struct Tlm 
    a::UInt16
    b::UInt32
    c::UInt32
    d::UInt32
    e::UInt32
    f::UInt32
    g::UInt32
    h::UInt32
    i::UInt32
    j::UInt32
    k::UInt32
end

the incoming message has 42 bytes, but the Julia structure has padding make it 44 bytes (after searching through the forums I think I understand why now, but its still a blocker).

Given this, i have pursued two paths

  1. Renterpret each field independenty and then make the object using the constructor
  2. Make a wrapper structure that uses getfield calls to reinterpret individual fields whenever requested

Option 2 reproduces a lot of functionality of structures and had some added complications, so I am leaning away from it. For option 1 the code is simple, but makes dozens of allocations and takes 10 us in my initial attempt, for which an MWE is

function unpack(::Type{T}, buffer::Vector{UInt8}, buffer_lock) where {T}
    N = fieldcount(T)
    ft = fieldtypes(T)
    nbs = sizeof.(ft)
    idx2 = cumsum(nbs)
    idx1 = (0, idx2[1:end-1]...)
    idx1 = idx1 .+ 1
    newT = lock(buffer_lock) do
        T(Tuple(reinterpret(ft[i], buffer[idx1[i]:idx2[i]])[1] for i in 1:N)...)
    end
    return newT
end

I haven’t been able to make the object without first collecting the arguments with Tuple, and splatting them back out; profiling indicates there are a lot of inference calls there (and maybe the tuple is getting put on the heap?). Either way, each unpack takes ~10 us and has over 50 allocations for several kilobytes. Given they way it will be used, this is on the boundary of acceptable, but seems way less efficient than it should be. So, that brings me to my questions. Is there an efficient way to either interpret a byte-packed buffer as a Julia struct, or at least to efficiently convert it? More specifically on my MWE, is there a way to make an object with its constructor using a generator, or in some other way not needing to collect the arguments and splat them back out again?

Thanks!

I tried some micro optimizations

using BenchmarkTools

struct TLM
    a::UInt16
    b::UInt32
    c::UInt32
    d::UInt32
    e::UInt32
    f::UInt32
    g::UInt32
    h::UInt32
    i::UInt32
    j::UInt32
    k::UInt32
end

function TLM(itr)
    a, b, c, d, e, f, g, h, i, j, k = itr
    TLM(a, b, c, d, e, f, g, h, i, j, k)
end

function unpack(::Type{T}, buffer::Vector{UInt8}, buffer_lock) where {T}
    N = fieldcount(T)
    ft = fieldtypes(T)
    idx2 = cumsum(sizeof.(ft))
    idx1 = (1, (idx2[1:end-1] .+ 1)...)
    newT = lock(buffer_lock) do
        T(reinterpret(ft[i], buffer[idx1[i]:idx2[i]])[1] for i in 1:N)
    end
    return newT
end

struct Descriptor{FT, TI}
    T::Type
    N::Int
    ft::FT
    idx1::TI
    idx2::TI
end

function Descriptor(::Type{T}) where {T}
    N = fieldcount(T)
    ft = fieldtypes(T)
    idx2 = cumsum(sizeof.(ft))
    idx1 = (1, (idx2[1:end-1] .+ 1)...)
    Descriptor(T, N, ft, idx1, idx2)
end 

function _reinterpret(::Type{T}, buffer, idx1, idx2) where {T}
    if T == UInt16
        result = 0
        for idx in idx1:idx2
            result <<= 8
            result += buffer[idx]
        end
        return result
    elseif T == UInt32
        result = 0
        for idx in idx1:idx2
            result <<= 8
            result += buffer[idx]
        end
        return result
    else
        @assert false
    end
end

function unpack(desc::Descriptor, buffer::Vector{UInt8}, buffer_lock, opt)
    newT = lock(buffer_lock) do
        if !opt
            desc.T(reinterpret(desc.ft[i], buffer[desc.idx1[i]:desc.idx2[i]])[1] for i in 1:desc.N)
        else
            desc.T(_reinterpret(desc.ft[i], buffer, desc.idx1[i], desc.idx2[i]) for i in 1:desc.N)
        end
    end
    return newT
end

@btime unpack(TLM, buffer, lock) setup=(
    buffer = zeros(UInt8, 42);
    lock = ReentrantLock())

@btime unpack(desc, buffer, lock, false) setup=(
    desc = Descriptor(TLM);
    buffer = zeros(UInt8, 42);
    lock = ReentrantLock())
    
@btime unpack(desc, buffer, lock, true) setup=(
    desc = Descriptor(TLM);
    buffer = zeros(UInt8, 42);
    lock = ReentrantLock())    

yielding

  5.114 ΞΌs (38 allocations: 2.62 KiB) #original unpack
  2.656 ΞΌs (24 allocations: 1.42 KiB) # splitting reflection and reinterpretation
  1.860 ΞΌs (2 allocations: 400 bytes) # + specialized reinterpretation

Not sure if we can get this completely allocation free…

1 Like

Not with reinterpret, since that requires a heap allocated buffer - twiddling bits and calling the constructor manually should work though.

1 Like

How about with this: I made a generated function that reads each field from a reinterpreted view of the buffer with the length of exactly one element. I assumed that this might be optimized quite well by the compiler because it can delete the reinterpret machinery if it’s not needed further than in the constructor.

I didn’t set up the example like you did but instead just read 100 of these objects from a buffer and add some fields together to show they can be accessed. The whole loop then takes 4.5 microseconds, so hopefully that is faster than the other solutions?

using BenchmarkTools

struct TLM
    a::UInt16
    b::UInt32
    c::UInt32
    d::UInt32
    e::UInt32
    f::UInt32
    g::UInt32
    h::UInt32
    i::UInt32
    j::UInt32
    k::UInt32
end

@generated function struct_at_index(TType, buf::Vector{UInt8}, i)

    getparam(::Type{Type{T}}) where T = T

    T = getparam(TType)
    fs = fieldtypes(T)
    sizes = sizeof.(fs)
    sz = sum(sizes)
    offsets = cumsum([0; sizes[1:end-1]...])

    constructor_args = map(fs, sizes, offsets) do _ft, _sz, _offset
        :(reinterpret($_ft, view(buf, location + $_offset : location + $_offset + $_sz - 1))[1])
    end

    constructor = Expr(:call, T, constructor_args...)

    quote
        location = $sz * (i - 1) + 1
        $constructor
    end
end

buf = rand(UInt8, 42 * 100)

struct_at_index(TLM, buf, 1)

function testfunc(buf)
    sum(1:100) do i
        tlm = struct_at_index(TLM, buf, i)
        tlm.a + tlm.d
    end
end
julia> @btime testfunc($buf)
  4.472 ΞΌs (0 allocations: 0 bytes)
0x0000003624182e4c
2 Likes

Oh and if you do this

quote
    @inbounds begin
        location = $sz * (i - 1) + 1
        $constructor
    end
end

it seems even faster and the output from @code_native struct_at_index(TLM, buf, 1) looked much cleaner

julia> @btime testfunc($buf)

  197.484 ns (0 allocations: 0 bytes)
0x00000030662eef16

I’m not sure if I messed up and this is just fast because it’s wrong :smiley:

Maybe just read the fields from IOBuffer(buffer) rather than calling reinterpret on views?

4 Likes

Note that there have been various efforts to make packages to address this sort of thing, but they all seem moribund: StrPack.jl and StructIO.jl

This seems slower:

@generated function struct_at_index2(TType, buf::Vector{UInt8}, i)

    getparam(::Type{Type{T}}) where T = T

    T = getparam(TType)
    fs = fieldtypes(T)
    sizes = sizeof.(fs)
    sz = sum(sizes)

    constructor_args = map(fs) do _ft
        :(read(bufviewio, $_ft))
    end

    constructor = Expr(:call, T, constructor_args...)

    quote
        @inbounds begin
            location = $sz * (i - 1) + 1
            bufviewio = IOBuffer(view(buf, location:location+$sz-1))
            $constructor
        end
    end
end

testfunc comes out at 1.685 ΞΌs with that code. The code_native output is much longer again, the native code in my previous example is surprisingly short:

julia> @code_native struct_at_index(TLM, buf, 1)
        .section        __TEXT,__text,regular,pure_instructions
; β”Œ @ Untitled-1:17 within `struct_at_index`
        movq    %rdi, %rax
; β”‚β”Œ @ Untitled-1:35 within `macro expansion`
; β”‚β”‚β”Œ @ int.jl:88 within `*`
        imulq   $42, %rdx, %rcx
        movq    (%rsi), %rdx
        movzwl  -42(%rcx,%rdx), %esi
        movq    -40(%rcx,%rdx), %rdi
        vmovups -32(%rcx,%rdx), %ymm0
; β”‚β””β””
; β”‚β”Œ @ Untitled-1:36 within `macro expansion`
        movw    %si, (%rax)
        movq    %rdi, 4(%rax)
        vmovups %ymm0, 12(%rax)
        vzeroupper
        retq
        nopw    (%rax,%rax)
; β””β””
1 Like

Thanks everyone for the quick replies. @jules the generated function works fantastic. I’ve only done a little dabbling with macros (and never generated functions) so i’m still trying to understand exactly what is going on, but that does what I wanted.

Adding the locking back in (with the do syntax) broke it (says the AST is no longer pure… but maybe I’m just doing it wrong?). Just calling lock / unlock works, and while it adds overhead (20x longer than without), it is still a 100x speed up over my MWE (10 us, allocations β†’ 100 ns, no allocations). But, for my edification, does anyone know why doing

quote
    @inbounds begin
        lock(buffer_lock) do
            $constructor
        end
    end
end

is disallowed in a generated function, while

quote
    @inbounds begin
        lock(buffer_lock)
        newT = $constructor
        unlock(buffer_lock)
        newT
    end
end

is fine?

Putting all this together an alternate KISS version

using BenchmarkTools

struct TLM
    a::UInt16
    b::UInt32
    c::UInt32
    d::UInt32
    e::UInt32
    f::UInt32
    g::UInt32
    h::UInt32
    i::UInt32
    j::UInt32
    k::UInt32
end

function TLM(buffer::IOBuffer)
    a = read(buffer, UInt16)
    b = read(buffer, UInt32)
    c = read(buffer, UInt32)
    d = read(buffer, UInt32)
    e = read(buffer, UInt32)
    f = read(buffer, UInt32)
    g = read(buffer, UInt32)
    h = read(buffer, UInt32)
    i = read(buffer, UInt32)
    j = read(buffer, UInt32)
    k = read(buffer, UInt32)
    TLM(a, b, c, d, e, f, g, h, i, j, k)
end

function unpack(::Type{T}, buffer::IOBuffer, buffer_lock) where {T}
    newT = lock(buffer_lock) do
        T(buffer)
    end
    return newT
end

@btime unpack(TLM, IOBuffer(buffer), lock) setup=(
    buffer = zeros(UInt8, 42);
    lock = ReentrantLock())

yielding

  75.310 ns (1 allocation: 64 bytes)
2 Likes

Also, for anyone like myself who is new to generated functions, instead of writing an extra function to get the type, as in

the @generated macro was fine with using a parameterized function signature, to get T,

@generated function struct_at_index2(::Type{T}, buf::Vector{UInt8}, i) where {T}

CBinding.jl works well for these use cases as well, especially when there is a C header/type definition of the data or if different alignment/packing is used. It might even perform better as well…

julia> using CBinding

       c``

       c"""
       #include <stdint.h>
       """s

       const c"uint16_t" = UInt16
       const c"uint32_t" = UInt32

       c"""
       struct Tlm {
           uint16_t a;
           uint32_t b, c, d, e, f, g, h, i, j, k;
       } __attribute__((packed));
       """j

julia> sizeof(Tlm)
42

julia> function testfunc(ptr::Cptr{Tlm})
           sum(1:100) do i
               ptr[i].a[] + ptr[i].d[]
           end
       end

julia> buf = rand(UInt8, 42*100);

julia> @btime testfunc($(Cptr{Tlm}(pointer(buf))))
  56.697 ns (0 allocations: 0 bytes)
0x00000030a830eaa5

julia> @code_native testfunc(Cptr{Tlm}(pointer(buf)))
        .text
; β”Œ @ REPL[11]:1 within `testfunc'
        pushq   %rax
; β”‚ @ REPL[11]:2 within `testfunc'
        movq    %rdi, (%rsp)
; β”‚β”Œ @ reducedim.jl:874 within `sum'
; β”‚β”‚β”Œ @ reducedim.jl:874 within `#sum#680'
; β”‚β”‚β”‚β”Œ @ reducedim.jl:878 within `_sum'
; β”‚β”‚β”‚β”‚β”Œ @ reducedim.jl:878 within `#_sum#682'
; β”‚β”‚β”‚β”‚β”‚β”Œ @ reducedim.jl:310 within `mapreduce'
; β”‚β”‚β”‚β”‚β”‚β”‚β”Œ @ reducedim.jl:310 within `#mapreduce#672'
; β”‚β”‚β”‚β”‚β”‚β”‚β”‚β”Œ @ reducedim.jl:318 within `_mapreduce_dim'
        movabsq $.rodata.cst16, %rsi
        movabsq $_mapreduce, %rax
        movq    %rsp, %rdi
        callq   *%rax
; β”‚β””β””β””β””β””β””β””
        popq    %rcx
        retq
; β””

1 Like

I like keeping it simple. However, I am trying too keep things generic (I have many telemetry types and for code hygiene don’t want to write a function like

that requires manually specifying all of the parameters and types again (only want to do that once, at the struct definition). That said, maybe there is a way to write this generically that gets the original performance (which was on par with the generated function method). I’ll give it a try.

Thanks!

Let me introduce you all to the magical wonders of @nexprs:

julia> function example(buf)
           a = read(buf, UInt16)
           Base.Cartesian.@nexprs 9 i -> (a_i = read(buf, UInt32))
           (a, a_1,a_2,a_3,a_4,a_5,a_6,a_7,a_8,a_9)
       end
example (generic function with 1 method)

julia> example(IOBuffer([0x00 for _ in 1:100]))
(0x0000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000)

julia> using BenchmarkTools

julia> @benchmark example(io) setup=(io=IOBuffer([0x00 for _ in 1:100])) evals=1
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  40.000 ns …  1.765 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     44.000 ns              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   50.773 ns Β± 29.007 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–…β–‡β–ˆβ–ˆβ–‡β–…β–„β–ƒβ–ƒβ–ƒβ–ƒβ–‚β–‚β–‚β–‚β–‚β–β–β–‚β–ƒβ–‚β–β–β–β–β–β–β–β–β–β–β–β–β–β–β– ▁ ▁                    β–‚
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–‡β–†β–‡β–‡β–‡β–†β–†β–‡β–†β–‡β–†β–†β–†β–…β–†β–…β–… β–ˆ
  40 ns        Histogram: log(frequency) by time       102 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

The only downside is that you have to specify the number of variables for @nexprs as a literal, so no automatic generation (well aside from in a generated function…).

As an aside, you won’t get around manually writing the reinterpretation in quite a few cases anyway, since all this reinterpret business is really only kosher for isbits types, i.e. no pointers. For those you will have to go through constructors.

2 Likes

Ah yeah that looks better, I was kind of noodling my way through there.

The answer is in this post Constraints for `@generated` function - #3 by Chris_Foster

The generated function is not allowed to introduce a new type, and that’s what happens with a closure.

1 Like

My macro-fu is weak, but here we go

using BenchmarkTools

function readstruct(T, buffer)
    fns = fieldnames(getfield(@__MODULE__, T))
    fts = fieldtypes(getfield(@__MODULE__, T))
    stmts = []
    for (fn, ft) in zip(fns, fts)
        push!(stmts, :($(esc(fn)) = read($(esc(buffer)), $ft)))
    end
    args = [:($(esc(fn))) for fn in fns]
    push!(stmts, :($T($(args...))))
end

macro readstruct(T, buffer)
    stmts = readstruct(T, buffer)
    quote $(stmts...) end
end

struct TLM
    a::UInt16
    b::UInt32
    c::UInt32
    d::UInt32
    e::UInt32
    f::UInt32
    g::UInt32
    h::UInt32
    i::UInt32
    j::UInt32
    k::UInt32
end

function TLM(buffer::IOBuffer)
    # @show @macroexpand(@readstruct TLM buffer)
    @readstruct TLM buffer
end

function unpack(::Type{T}, buffer::IOBuffer, buffer_lock) where {T}
    newT = lock(buffer_lock) do
        T(buffer)
    end
    return newT
end

@btime unpack(TLM, IOBuffer(buffer), lock) setup=(
    buffer = zeros(UInt8, 42);
    lock = ReentrantLock())

yielding

  74.097 ns (1 allocation: 64 bytes)

Critique welcome!

Edit: for everyone asking (like me) about the difference between IOBuffers read and reinterpret.

And since we opened the gates to the paradise of Base.Cartesian let’s go all the way in:

function unpack(::Type{T}, buffer::IOBuffer, buffer_lock) where T
    return lock(buffer_lock) do
        unsafe_unpack(T, buffer)
    end
end

@generated function unsafe_unpack(::Type{T}, buffer) where T
    ftypes = fieldtypes(T)
    l = length(ftypes)
    return :(begin
        Base.Cartesian.@nexprs $l i->a_i = read(buffer, $(ftypes)[i])
        Base.Cartesian.@ncall $l $T a
    end)
end

@nhardy should be general enough for all isbits types :wink:

struct TLM
    a::UInt16
    b::UInt32
    c::UInt32
    d::UInt32
    e::UInt32
    f::UInt32
    g::UInt32
    h::UInt32
    i::UInt32
    j::UInt32
    k::UInt32
end

using BenchmarkTools

@benchmark unpack(TLM, IOBuffer(buffer), lock) setup=(
           buffer=zeros(UInt8, 42);
           lock=ReentrantLock())

gives me

101.729 ns (1 allocation: 64 bytes)

That’s consistently ~10% faster (at least on my laptop :P), in both minimum and median times than the slightly more complicated version of @goerch.

However I particularly like the version by @krrutkow for the extremely short @code_native. What we get here, while fast, is not pretty :stuck_out_tongue:

2 Likes

Hi @jules , @abulak,

being relatively new to the world of meta-programming in Julia. What is your advice to switch from macro to generated? I’ve seen some older advice regarding unrolling w.r.t. to compiler weaknesses as well as advice to avoid generated. Is there any rule of the thumb currently?

Thanks in advance!

Thanks again to the Julia community, these are some fantastic suggestions - had never come across generated functions before this. The initial suggestion from @jules got me over the hump, but I am going to implement some of the others for comparison. I especially like @abulak’s suggestion - the combination of IOBuffer reads, @nexpr, and @generated gives concise, clean code that I have a hope of understanding when I inevitably revisit it in 3 years.