Big endian conversion on custom datatypes

tamasgal · November 14, 2019, 7:54am

I have to deal with a couple of “big endian” structures in raw files and want to parse them into Julia structs.

Here is an MWE, where I have some data (big endian, coming from the network or from a file), a struct and a little function to parse a Vector{UInt8} into a given type:

data = Vector{UInt8}([0x00, 0x00, 0x00, 0x64, 0x00, 0x00, 0x00, 0x32])

struct Foo
    a::Int32
    b::Int32
end

function retrieve(::Type{T}, data) where {T}
    ref = Ref{T}()
    read!(IOBuffer(data), ref)
    return ref[]
end

The only problem is of course the endianness:

julia> f = retrieve(Foo, data)
Foo(1677721600, 838860800)

julia> f.a
1677721600

julia> f.b
838860800

julia> ntoh(f.a)  # the correct value of a
100

julia> ntoh(f.b)  # the correct value of b
50

Now I am not sure how to deal with the big endian conversion ntoh() effectively, since I am reading a lot of data and there are many different structures to parse.

I though about creating a macro like StrPack does (@struct) and use that instead of struct to create my types and basically read the fields and convert them but this Ref and read!() workflow seems to be much more efficient than reading the data piece by piece. I also thought about using StrPack itself but it’s currently not working on julia 1.x and also it seems to be an overkill since I only need to deal with big endians, no paddings or other annoying stuff. That would however be a community contribution, which is a bonus of course.

Anyways, to solve the main problem first: is there any clever way to somehow hook into the Ref+read! stuff? Of course doing ntoh() on the data itself is nonsense because those are already octets and also read!() does not know anything about the structure of T, it just fills the reference. So it feels like it’s the wrong place to “hack”.

On the other hand, one solution which might be OK is something like read_big_endian!(::Type(T), io, ref) where I read the exact amount of data from a buffer given the size of the struct and then create some logic to iterate over its fields which are needed to be converted to big endians and swap the bytes of the buffer data in memory (in place) using reverse!() before actually calling read!().

Here is a hardcoded version just for demonstration purposes

function retrieve_big_endian_32(::Type{T}, data) where {T}
    ref = Ref{T}()
    for idx in range(1; length=Int(length(data)/4), step=4)
        reverse!(data, idx, idx+3)
    end
    read!(IOBuffer(data), ref)
    return ref[]
end

julia> data = Vector{UInt8}([0x00, 0x00, 0x00, 0x64, 0x00, 0x00, 0x00, 0x32])
8-element Array{UInt8,1}:
 0x00
 0x00
 0x00
 0x64
 0x00
 0x00
 0x00
 0x32

julia> retrieve_big_endian_32(Foo, data)
Foo(100, 50)

The big question to the experts is: how does an operation like reverse() on the raw data compares to ntoh from the performance point of view? I tried some benchmarks and it seems that ntoh (which calls bswap) is more or less a noop but I need to invest more time on the implementation to compare both approaches. For integers, bswap is calling bswap_int which is in base/compiler/tfuncs.jl and refers to a C function.

I am sorry that I have not invested more time, but I hope that some low level experts might push me into the right direction before I dive into complicated macros or alike

jling · November 14, 2019, 8:41am

ah, ROOT file , julia has bswap too
https://docs.julialang.org/en/v1/base/numbers/#Base.bswap

I think this is enough

tamasgal · November 14, 2019, 8:51am

Yeah, as I wrote, I already know about bswap but the question is the overall design (see last paragraphs).

tamasgal · November 14, 2019, 10:01am

Here are some toy examples (all of them tailored to Foo):

function retrieve_foo(data)
    buf = IOBuffer(data)
    Foo(read(buf, Int32), read(buf, Int32), read(buf, Float32))
end

function retrieve_foo_via_ref(data)
    ref = Ref{Foo}()
    read!(IOBuffer(data), ref)
    return ref[]
end


function retrieve_big_endian_32_foo(data)
    ref = Ref{Foo}()
    @inbounds for idx in range(1; length=Int(length(data)/4), step=4)
        reverse!(data, idx, idx+3)
    end
    read!(IOBuffer(data), ref)
    return ref[]
end


function retrieve_big_endian_32_foo_hardcoded_using_reverse_and_ref(data)
    ref = Ref{Foo}()
    reverse!(data, 1, 4)
    reverse!(data, 5, 8)
    reverse!(data, 9, 12)
    read!(IOBuffer(data), ref)
    return ref[]
end


function retrieve_big_endian_32_foo_hardcoded_using_ntoh(data)
    buf = IOBuffer(data)
    Foo(ntoh(read(buf, Int32)), ntoh(read(buf, Int32)), ntoh(read(buf, Float32)))
end

Here are some performance checks:

data = Vector{UInt8}([0x00, 0x00, 0x00, 0x64, 0x00, 0x00, 0x00, 0x32, 0x00, 0x00, 0x00, 0x16]);

@btime retrieve_foo($data)
  11.946 ns (1 allocation: 64 bytes)
Foo(1677721600, 838860800, 1.0339758f-25)

@btime retrieve_foo_via_ref($data)
  14.791 ns (2 allocations: 96 bytes)
Foo(1677721600, 838860800, 1.0339758f-25)

@btime retrieve_big_endian_32_foo($data)
  37.313 ns (2 allocations: 96 bytes)
Foo(100, 50, 3.1f-44)
retrieve_big_endian_32_foo_hardcoded_using_reverse_and_ref

@btime retrieve_big_endian_32_foo_hardcoded_using_reverse_and_ref($data)
  26.535 ns (2 allocations: 96 bytes)
Foo(100, 50, 3.1f-44)
retrieve_big_endian_32_foo_hardcoded_using_ntoh

@btime retrieve_big_endian_32_foo_hardcoded_using_ntoh($data)
  11.652 ns (1 allocation: 64 bytes)
Foo(100, 50, 3.1f-44)

It seems that retrieve_foo, which is the reference, has the same performance as retrieve_big_endian_32_foo_hardcoded_using_ntoh. I still do not understand the ntoh magic, but I can live with that

Maybe someone can shed light on that…

Topic		Replies	Views
Change endianness in reinterpret General Usage question	3	582	May 10, 2022
Fast reading of multiple big-endian binary files Performance binaryio	1	793	December 18, 2020
How to read big-endian data General Usage binaryio , io	5	3399	September 24, 2019
Readbytes! is bugging me New to Julia binaryio , data	8	2582	April 22, 2019
Endian of reinterpret(T, array) General Usage	2	999	August 8, 2017

Big endian conversion on custom datatypes

Related topics