Packing and unpacking binary data

I am trying to port my stuff from Python to Julia and struggling with a piece of code which implements a basic network communication protocol.

Update: The original problem is solved and turned out to be a typo. Feel free to skip to my next question :wink:

1 Like

In Python, I heavily use the struct.pack and struct.unpack stuff, which is basically the way to encode and decode binary data. What is the actual way of parsing bytes in Julia?

Here is an example:

In [19]: struct.pack('>ii', 23, 42)
Out[19]: b'\x00\x00\x00\x17\x00\x00\x00*'
In [20]: struct.unpack('>ii', b'\x00\x00\x00\x17\x00\x00\x00*')
Out[20]: (23, 42)

Now in Julia, if I execute read(s, 16) to receive the network packet header of size 16, I get an array of UInt8:

16-element Array{UInt8,1}:
 0x66
 0x6f
 0x6f
 0x00
 0x00
 0x00
 0x00
 0x00
 0x00
 0x00
 0x00
 0x04
 0x00
 0x00
 0x00
 0x00```

So what would be the standard way of parsing the data according to a given structure? Should I write an immutable type which constructs itself from a given array or is there something already in the standard library which is made for these kind of operations?

Sorry I am not familiar with Python, so I am not sure I understand the problem, but check serialize and deserialize in Base. Neither requires that you specify the type, it will be encoded in the stream; however, the type should be defined in the process you deserialize.

Szia,

thanks for the quick reply. I already looked at [de]serialze but it seems that I cannot specify the structure manually. I need that since I am talking to processes written in different languages (all using the same custom protocol).

So what I basically mean is, if there is an example binary data structure like:

foo [4byte integer], bar [8byte float64], baz [4 byte integer]

which is for example this as hex string (foo=23, bar=3.14, baz=42):

'\x00\x00\x00\x17@H\xf5\xc3\x00\x00\x00*'

This can be easily unpacked via the struct module in the Python standard library, where a simple tuple is returned:

In [22]: struct.unpack('>idi', b'\x00\x00\x00\x17@\t\x1e\xb8Q\xeb\x85\x1f\x00\x00\x00*')
Out[22]: (23, 3.14, 42)

In Julia I’d define a type/immutable like

immutable Whatever
    foo::Int32
    bar::Float64
    baz::Int32
end

and then my question is, how to create an instance if I use the data actually returned by the read() function:

julia> raw_data
16-element Array{UInt8,1}:
 0x00
 0x00
 0x00
 0x17
 .
 .
 .
 0x00
 0x00

since I need to know the size of each attribute of Whatever.

Should I write a specific constructor for Whatever, or a helper function which iterates through the Whatever-attributes, determining the size etc?

1 Like

I’d actually strongly against recommend against using the Base.serialize / Base.deserialize functions, if you are doing anything that needs to persist data, as the format is not documented, and is not guaranteed to change incompatibly between Julia versions.

1 Like

OK, any other suggestions then? I am currently quite confused how to go from an array of UInt8 (this is what I get when I read from the socket stream) to a Whatever-object (I actually use to do the calculations) and then to an actual string representation (which I need to send back via the network socket).

Converting the UInt8 array to a string is done by String(). I tried string() before but that was not the right one.

So now I am playing around with reinterpret but I need a way to deal with different endianness. So I guess I have to reverse the array if needed :confused:

julia> a[9:12]
4-element Array{UInt8,1}:
 0x00
 0x00
 0x00
 0x04

julia> reinterpret(Int32, a[9:12])
1-element Array{Int32,1}:
 67108864

julia> reinterpret(Int32, reverse(a[9:12]))
1-element Array{Int32,1}:
 4

This is what I came up with. It is quite an ugly implementation but it works for now. It would be great if someone could point me to the right direction how to do this more elegantly, like using sizeof() for automatically derive the data positions etc.

julia> raw_data = read(s, 16)
16-element Array{UInt8,1}:
 0x66
 0x6f
 0x6f
 0x00
 0x00
 0x00
 0x00
 0x00
 0x00
 0x00
 0x00
 0x04
 0x00
 0x00
 0x00
 0x00

The corresponding data representation in Julia:

immutable CHPrefix
    tag::String
    length::Int32
    
    function CHPrefix(data::Array{UInt8,1})
        tag = String(data[1:8])
        length = reinterpret(Int32, reverse(data[9:12]))[1]
        new(tag, length)
    end
end

And this is how it works now (still need to find out how to strip the \0 from the String but that should be trivial):

julia> CHPrefix(raw_data)
CHPrefix("foo\0\0\0\0\0",4)

There is a bswap function in Julia, that you can use on all the elements of the vector once you read it in.
If you are using v0.6, then it becomes beautifully fast and simple to do in-place: :smile:

julia> a = UInt16[1,2,3,4,5]
5-element Array{UInt16,1}:
 0x0001
 0x0002
 0x0003
 0x0004
 0x0005

julia> a .= bswap.(a)
5-element Array{UInt16,1}:
 0x0100
 0x0200
 0x0300
 0x0400
 0x0500
1 Like

Note: you can also put the input string into an IOBuffer, or directly read the parts from the file, and read the types with:
read(io, type), so for the Int32, it would be read(io, Int32), and if you know it is in reversed order, then bswap(read(io,Int32))

Ah, that’s already very useful, thanks!

See also:
https://github.com/pao/StrPack.jl

and

https://github.com/tanmaykm/ProtoBuf.jl

1 Like

Thanks. Protobuf is not an option for me but I will have a look at StrPack, although I hope I can stick with the standard library.

Yes, it’s pretty trivial! (Julia’s great that way :wink: )
rstrip(str, '\0') will remove any trailing nul bytes.

Here’s a minimal example (given your definitions above)

julia> a = b"\x00\x00\x00\x17@\t\x1e\xb8Q\xeb\x85\x1f\x00\x00\x00*"
julia> let buf=IOBuffer(a); Whatever(hton(read(buf,Int32)), hton(read(buf,Float64)), hton(read(buf,Int32))) end
Whatever(23,3.14,42)

As Scott pointed out, you can use read(buf,T,...) to read “arbitrary bytes” into “Julia objects”. But one issue to be aware of (aside from byte order) is that Julia uses C layout rules. So your struct definition:

immutable Whatever
    foo::Int32
    bar::Float64
    baz::Int32
end

is actually laid out in memory [8 bytes,8 bytes, 8 bytes], and the total size is 24. (per the rules, if you did [::Int32, ::Int32, ::Float64] instead, then you would get a 16-byte struct). StrPack provides tooling for working with more flexible layouts.

1 Like

That should actually be ntoh instead of hton above. (since he is going from network format (i.e. big-endian) to host format (which is I think currently always little-endian for Julia [even on POWER platforms]).
Good catch about the C layout rules, and that would be different also on a 32-bit platform, you’d have 16 bytes instead of 24 for the Whatever structure) (since there’d be no padding)

Please mark one of the relevant answers as the solution using the “check” (or “tick”) button/icon that should appear after clicking the “…”.

Great, that makes sense now, thanks. It seems I will have to study StrPack, since the layouts of the binary formats I am dealing with often contains also paddings and also mixed endianness. I don’t want to build my package on sandy grounds :wink:

Kind of hard to mark a solution as there are many useful inputs…

1 Like