Tips for fast parsing of "ad-hoc" protocol

I’m in the process of trying to implement a fast-ish protocol parser/reader/whatever for a (from what I can tell) ad-hoc:ish protocol. Use case for context is to speed up a ‘let’s start this before lunch and maybe it’s done after’-task to something which can be a bit more interactive, say 5-10 minutes and despite the complications listed below I think there is plenty of opportunity to achieve this.

The protocol itself is defined in a parseable text format and the data itself seems to be in some raw bytes format. The headers were pretty straight forward as I could just generate Julia structs from the protocol spec and reinterpret the data stream and it seems to be quite fast (about 10 ns per packet), similar to what is done here. In case it matters the primary use case is for when the data has been dumped to a file and then reading that file which I guess is a little bit different from traditional networking cases.

However, the next step poses a bit of a challenge: From what I understand the headers contain a versioning number for the spec of the payload protocol and this version changes very frequently (daily) and every revision is generally breaking. This protocol is also quite diverse with maybe 1000+ different ways to interpret that data. I guess the only upside is that the data itself is always numbers. I think at this time these constraints are non-negotiable. Mockup example of what the “inner spec” looks like:

ID: 456 <a=float16, b=int8,..>
ID: 457 <gazelle=uint16, lion=float32,..>

Where ID is found in the header. In case its not clear, a=float16 means basically that the first 16 bits of the stream (after header) is the parameter a which is a floating point number. The number of parameters in each row is not the same. I think/hope that all number types are byte-aligned, but I think I can deal with it even if they aren’t as long as each line is byte-aligned.

Now, I suppose that the “generate structs” approach is impractical at best. My plan is to download the encountered version of the spec and in the same lazy manner translate each encountered ID into some dynamic “template” for parsing it (obviously caching everything).

The narrow version of my question is simply if there is some format of this template which is better than others? A naïve staring point would be to just use a DataFrame where the text format is used to create column names and their types and the parsing would be to reinterpret the data based on the column types, but I’m not sure fast this is compared to the best one can do in this case.

Second part is if there is any special concern one would need to take to make this work efficiently in a threaded and/or distributed (i.e a compute cluster) environment other than putting locks when creating the “templates”? Ambition level for parallelism is multiple streams of data, not to parallelize processing of one single stream, but if there are tips for the latter I’m listening.

I’m not sure storing the data in a structure is worth the headache, you might just go with a Dict, it really depends on how you are processing the data once you get it. From a multiple threads point of view is a stream independent of the other threads or do they all get merged together?

Thanks for the loopback!

OP was a bit diffuse so I understand that it could be difficult to give good advice. I have gone for the dict-cache approach w.r.t the templates.

Here is a MWE which kind of reflects the current state. Question is how it can be made faster?

struct Template
    dataTypes::Vector{DataType} # Is there a better representation which keeps ts below concrete? StaticArray perhaps?
end

# Parse the "spec", optimising this step is far down on the list of priorities
ts = [Template(rand([UInt8, Int32, Float32], rand(10:100))) for _ in 0:typemax(UInt8)];

julia> isconcretetype(typeof(ts)) # I Think this means that code using ts will be type stable??
true

function decodeit(data, ts)
   GC.@preserve data begin
	   p = pointer(data)
	   header = unsafe_load(convert(Ptr{UInt8}, p))+1
	   decodeit(ts[header], p+1)
   end
end

function decodeit(t::Template, data::Ptr)
   # Naive implementation. How to do this 1) fast and 2) so that the output is type safe?
   offs = 0
   map(Tuple(t.dataTypes)) do dt
		   res = unsafe_load(convert(Ptr{dt}, data+offs))
		   offs += sizeof(dt)
		   return res
   end
end

julia> @code_warntype decodeit(rand(UInt8, 1000), ts);
Variables
  #self#::Core.Const(decodeit)
  data::Vector{UInt8}
  ts::Vector{Template}
  header::UInt8
  p::Ptr{UInt8}

Body::Any
1 ─ %1 = $(Expr(:gc_preserve_begin, :(data)))
│        (p = Main.pointer(data))
│   %3 = Core.apply_type(Main.Ptr, Main.UInt8)::Core.Const(Ptr{UInt8})
│   %4 = Main.convert(%3, p)::Ptr{UInt8}
│        (header = Main.unsafe_load(%4))
│   %6 = Base.getindex(ts, header)::Template
│   %7 = (p + 1)::Ptr{UInt8}
│   %8 = Main.decodeit(%6, %7)::Any
│        $(Expr(:gc_preserve_end, :(%1)))
└──      return %8

julia> @benchmark decodeit(data, $ts) setup=(data=rand(UInt8, 1000))
BenchmarkTools.Trial: 
  memory estimate:  704 bytes
  allocs estimate:  36
  --------------
  minimum time:     3.862 μs (0.00% GC)
  median time:      22.356 μs (0.00% GC)
  mean time:        24.093 μs (0.70% GC)
  maximum time:     355.488 μs (94.95% GC)
  --------------
  samples:          10000
  evals/sample:     8

With a hardcoded template (a struct with a tailored decoding method) the above seems to be about 1000 times faster (barring any methodology errors) and I’m hoping that there is a way to get closer to that speed.

I could ofc just benchmark my way forward and try different approaches and that is what I’m doing whenever I have some time to spend on this, but it is starting to become a bit of a trial and error and there are many permutations to try out.

I’m toying with the idea to just let the user (me) just supply a callback to collect the data after headers have been removed along with the template and the headers. Not sure if this is just going to move the problem in the end though.

I also suppose I could just promote to some large type (I think Float64 should be enough) after decoding and return a vector of that, but since data is typically sizey I would like to be able to capitalize on the fact that most types are smaller.

I think that the typical end-destiation for the data is a DataFrame and the reason why I mentioned it is that they might have some tricks to deal with the stuff above. I will look into that and CSV to see if there is some pattern there I can borrow.

About the multi-threading: Nothing sophisticated like merging streams. It had more to do with race conditions when creating the cache of templates and what happens with it in a distributed setting, but lets defer that for now.

Fwiw, I for now went with two approaches:

For process-on-the-fly usecases I wrote a semi-ugly macro which puts an expression to read and process a piece of data, something like this: @process decoded=decodeit(data, template[i], i) result = decoded which expands to basically putting the above inside an if-else statement over each possible type which seems to alleviate the type instability.

For the case when one just wants to extract all data I simply deferred the reinterpretation so that the returned results is a struct wrapping the raw bytes for each ID and the type metadata.