Efficiently interpreting byte-packed buffer

I would say that @generated is exactly for cases like this one here. We can’t easily achieve type stability with normal functions because we have to loop over differently typed fields of a struct, and as we could see in the first attempts, the intermediary Tuples and splatting lead to unnecessary overhead. This pattern, where looping over a small but heterogenous collection is type unstable, is usually what leads to manual unrolling.

You can’t use simple macros here easily, because the code we want to write depends on input types to the functions. Macros are for cases where the code you want to write depends only on some expression. (You could eval stuff in a macro to approximate what @generated does but that’s very much not advisable, macros should be pure syntax transformations).

So here the idea is just to write out the constructor as we would manually do, so the compiler has all information directly available.

I didn’t know the macros from Base.Cartesian yet, these help not having to write out the expressions like I did.

2 Likes

You can avoid the whole @generated issue like this:

function unsafe_unpack(::Type{T}, buffer) where T
    fieldvals = ntuple(i -> read(buffer, fieldtype(T, i)), Val(fieldcount(T)))
    return T(fieldvals...)
end

This gives me the same performance as the @generated code above, and is quite a bit more transparent.

2 Likes

And wouldn’t you know it, the specific version with Val that you’re using here is also doing something generated https://github.com/JuliaLang/julia/blob/aa497b579fdfcd96608e5753f2806041424a3cba/base/ntuple.jl#L69

Pretty much the same thing as in one of the solutions above. I would argue it’s not necessarily super transparent that you can do this with Val, although the code is of course shorter if the complexity of the inner function is not visible at the surface.

In general I would say that it can be hard to predict with such functions which one’s going to be inferred and compiled best, and it’s honestly a little bit of trial and error.

1 Like

That’s what macros are for. I’m surprised that nobody has proposed a macro that acts on the struct definition yet, so let’s add one to the mix. This can absolutely be improved with respect to robustness and safety but the performance should be fine.

macro magic(ex)
    inner_constructor = quote
        function $(ex.args[2])(buffer::Vector{UInt8})
            new()
        end
    end

    new_args = inner_constructor.args[2].args[2].args[3].args
    i = 1
    for field in ex.args[3].args
        if field isa Expr
            data_type = field.args[2]
            push!(new_args,
                  :(unsafe_load(Ptr{$(data_type)}(pointer(buffer, $i)), 1)))
            i += sizeof(getfield(Main, data_type))
        end
    end
    push!(ex.args[3].args, inner_constructor)
    esc(ex)
end

@magic struct TLM
    a::UInt16
    b::UInt32
    c::UInt32
    d::UInt32
    e::UInt32
    f::UInt32
    g::UInt32
    h::UInt32
    i::UInt32
    j::UInt32
    k::UInt32
end

buffer = rand(UInt8, 42)
tlm = TLM(buffer)
2 Likes

I meant that it is easier to understand what the code is doing. Predicting performance is of course hard, as it often is.

Yes the clarity of the ntuple solution is much better than that of my generated version, however I was surprised that it should perform as well, and only understood why after going to the source. That’s what I meant by “not necessarily super transparent”.

1 Like

What I find additionally interesting is that the ntuple function doesn’t use the @generated macro like we normally do, it uses the no-argument version which just does this:

macro generated()
    return Expr(:generated)
end

And the source code of @generated(f) itself is this:

macro generated(f)
    if isa(f, Expr) && (f.head === :function || is_short_function_def(f))
        body = f.args[2]
        lno = body.args[1]
        return Expr(:escape,
                    Expr(f.head, f.args[1],
                         Expr(:block,
                              lno,
                              Expr(:if, Expr(:generated),
                                   body,
                                   Expr(:block,
                                        Expr(:meta, :generated_only),
                                        Expr(:return, nothing))))))
    else
        error("invalid syntax; @generated must be used with a function definition")
    end
end

So ntuple kind of uses part of this machinery which I don’t quite understand, this Expr(:if, Expr(:generated),. Maybe someone knows what Julia actually does with this?

Nice solution! You don’t even need ntuple then and can write the original unpack like

function unpack(::Type{T}, buffer::IOBuffer, buffer_lock) where {T}
    newT = lock(buffer_lock) do
        T((read(buffer, fieldtype(T, i)) for i in 1:fieldcount(T))...)
    end
    return newT
end

@btime unpack(TLM, IOBuffer(buffer), lock) setup=(
    buffer = zeros(UInt8, 42);
    lock = ReentrantLock())

yielding

  74.691 ns (1 allocation: 64 bytes)

with one allocation for the IOBuffer remaining…

1 Like

That’s the syntax for optionally-generated functions, which allows one to provide an alternative non-@generated implementation and let the compiler decide on which one to use.

1 Like

Huh. So the ntuple solution from @DNF makes sense (uses @generated under the hood), but I’m surprised the straight tuple solution from @goerch works. Regardless, and interestingly, I found that for both approaches they work when iterating over the number of elements and calling fieldtype(T, i) in the iteration, but if you do what I originally did and first get the fieldtypes and then reference them, it is inefficient (extra allocations and slower). That is,

T((read(buffer, fieldtype(T, i)) for i in 1:fieldcount(T))...)

is performant, but

fts = fieldtypes(T)
T((read(buffer, fts[i]) for i in 1:fieldcount(T))...)

and

fts = fieldtypes(T)
T((read(buf, ft) for ft in fts)...)

are not. It is worth noting that this is not the case in the explicitly generated versions, likely because the fieldtypes call happens in the @generated block and each fieldtype is made explicit in the code generation / unrolling from @nexprs.

I am not sure if I prefer these simple tuple/ntuple versions or the @generated versions. Simple is great, but it seems like they are relying on hidden compiler optimizations that - for now - mirror the explicitly generated version, but are a little fragile. Thoughts? I feel like the generated version is more likely to produce the same performance down the road (when I inevitably need to change something) even if the syntax is a little harder to parse.

Thanks again everyone for all of the great input.

This isn’t actually using a tuple, but a generator. I’m not sure I understand why some of those are fast and others not.

As for the tuple solution, I would normally be pretty confident that a statically sized tuple provides very good performance, so this doesn’t really seem that obscure or fragile to me. But occasionally, things that should be fast are not, for obscure reasons.

As for hiding complexity, I don’t know how to avoid that in programming, isn’t that sort of the whole point?Compilers, also, are hugely complex.

I guess the winner of my own private contest is the version given by @DNF:

function unsafe_unpack(::Type{T}, buffer) where T
    fieldvals = ntuple(i -> read(buffer, fieldtype(T, i)), Val(fieldcount(T)))
    return T(fieldvals...)
end

Avoiding generators is always good, and this one is again ~10% faster than the @generated one, while the generator version is 5% slower (again: on my laptop ;).