I’m in the process of trying to implement a fast-ish protocol parser/reader/whatever for a (from what I can tell) ad-hoc:ish protocol. Use case for context is to speed up a ‘let’s start this before lunch and maybe it’s done after’-task to something which can be a bit more interactive, say 5-10 minutes and despite the complications listed below I think there is plenty of opportunity to achieve this.
The protocol itself is defined in a parseable text format and the data itself seems to be in some raw bytes format. The headers were pretty straight forward as I could just generate Julia structs from the protocol spec and reinterpret the data stream and it seems to be quite fast (about 10 ns per packet), similar to what is done here. In case it matters the primary use case is for when the data has been dumped to a file and then reading that file which I guess is a little bit different from traditional networking cases.
However, the next step poses a bit of a challenge: From what I understand the headers contain a versioning number for the spec of the payload protocol and this version changes very frequently (daily) and every revision is generally breaking. This protocol is also quite diverse with maybe 1000+ different ways to interpret that data. I guess the only upside is that the data itself is always numbers. I think at this time these constraints are non-negotiable. Mockup example of what the “inner spec” looks like:
ID: 456 <a=float16, b=int8,..> ID: 457 <gazelle=uint16, lion=float32,..>
ID is found in the header. In case its not clear,
a=float16 means basically that the first 16 bits of the stream (after header) is the parameter
a which is a floating point number. The number of parameters in each row is not the same. I think/hope that all number types are byte-aligned, but I think I can deal with it even if they aren’t as long as each line is byte-aligned.
Now, I suppose that the “generate structs” approach is impractical at best. My plan is to download the encountered version of the spec and in the same lazy manner translate each encountered ID into some dynamic “template” for parsing it (obviously caching everything).
The narrow version of my question is simply if there is some format of this template which is better than others? A naïve staring point would be to just use a
DataFrame where the text format is used to create column names and their types and the parsing would be to reinterpret the data based on the column types, but I’m not sure fast this is compared to the best one can do in this case.
Second part is if there is any special concern one would need to take to make this work efficiently in a threaded and/or distributed (i.e a compute cluster) environment other than putting locks when creating the “templates”? Ambition level for parallelism is multiple streams of data, not to parallelize processing of one single stream, but if there are tips for the latter I’m listening.