Generating type specific deserialisers for BSON.jl

richiejp · June 26, 2019, 1:14pm

This is more of a blog post than a forum post, but I don’t maintain a
blog so…

I have spent quite some time trying to optimise the excellent
BSON.jl library as we make heavy use
of it. The result of this so far is some updates to the original library and a
fork called BSONqs which has more
radical changes (more on that at the bottom).

I like working with Julia’s structs and organise most of my data into
structs. Even if there is no need to do so for performance reasons and it
introduces some rigidity, I still prefer it over working with generic data
structures like Dicts or tables. I have a bad memory so I will forget what
properties the data has if it is not written down somewhere.

After trying a few different methods which will serialise arbitrary Julia
structs in a reliable way I found that BSON.jl worked the best for what I am
doing. I won’t go into any more details on that because at the time many
libraries were simply broken in Julia 1.0 or didn’t exist.

We store several GBs of very boring Linux test results in Redis and load large
chunks of it into RAM. This can take up to 30s and a lot of that time is spent
by BSON.jl. So I decided to try optimising its read performance as much as
possible.

Like most (de)serialisers BSON.jl first takes the raw BSON document and parses
that into an intermediate format. Which originally was a set of nested
Dicts. The dictionaries mirror the raw BSON very closely. Below you can see
how that looks.

julia> struct Foo
       bar::Set{String}
       end

julia> io = IOBuffer();

julia> BSON.bson(io, Dict(:foo => Foo(Set(["baz"]))))

julia> seek(io, 0);

julia> BSON.parse(io)
Dict{Symbol,Any} with 1 entry:
  :foo => Dict{Symbol,Any}(:tag=>"struct",:type=>Dict{Symbol,Any}(:tag=>"datatype",:params=>Any[],:name=>Any["Main", "Foo"]),:data=…

julia> ans[:foo]
Dict{Symbol,Any} with 3 entries:
  :tag  => "struct"
  :type => Dict{Symbol,Any}(:tag=>"datatype",:params=>Any[],:name=>Any["Main", "Foo"])
  :data => Any[Dict{Symbol,Any}(:tag=>"struct",:type=>Dict{Symbol,Any}(:tag=>"datatype",:params=>Any[Dict{Symbol,Any}(:tag=>"dataty…

After doing a few basic optimisations, the profiler showed that a lot of time
was being spent simply allocating dictionaries for the intermediate
format. From the above you can see the dictionaries have very few members and
these are always the same for serialised Julia data structures.

So instead of using Dicts I defined some structs for use in the intermediate
layer instead.

"""Type class which represents a tagged dictionary

Tagged dictionaries are used to represent complex Julia types. Using a struct
instead of an actual Dictionary requires less memory allocation and allows us
to use multiple dispatch on the resulting tree structure.

It inherits abstract dict just for show."""
abstract type Tagged <: AbstractDict{Symbol, Any} end

"Type class for types which can occupuy the 'type' field in a struct"
abstract type TaggedStructType <: Tagged end

...

mutable struct TaggedType <: TaggedStructType
  name::Vector{String}
  params::Vector{TaggedParam}
end

mutable struct TaggedStruct <: TaggedStructType
  ttype::TaggedStructType
  data::BSONArray
end

...

This resulted in an annoying and complex type system (maybe necessarily, maybe
not), but saved a huge amount of memory. Initially the performance was worse,
which was pretty upsetting, but I figured out this was because I had increased
the number of type unstable functions resulting in more dynamic
dispatch. After fixing some of that I got a 100% speedup.

At this point I decided to go in a completely different direction and try to
cut out the intermediate layer altogether. This basically means we now have
three BSON parsers:

Using Dicts for the intermediate layer (in BSON.jl)
Using dedicated types for the intermediate layer (in BSONqs.jl with load_compat)
No intermediate layer and type specific parsing (in BSONqs.jl with load)

One could argue that 3. still has an intermediate data structure, but that it
is allocated on the stack in the form of a finite state machine. The wisdom of
removing the intermediate layer is debatable as it makes a lot of things
harder to debug and reason about. Also the intermediate layer may also have
better cache efficiency.

On the other hand, with no intermediate layer we can copy data directly from
the input stream/buffer to the native Julia data structures. In theory we
could even make it zero-copy if Julia’s memory system allows it. This could be
extremely useful when parsing massive, contiguous arrays of numerical data
from a memory-mapped file. As we would not even need to read most of the data
until it is used in a computation.

The third parser is by far the quickest for large sets of structs. However
while I have differentiated the parsers by the intermediate layer or lack
thereof. The third parser is also more aware of the resultant Julia type of
the data being parsed.

In many cases we know in advance what type the data should be that we are
parsing. For example take the following structure.

struct Foo
  bar::Vector{Int}
end

bson(io, Dict(:foo => Foo(rand(Int, 1000))))

When we come to parse the :foo entry, we will have to determine its type
(which is Foo) first. However after that point we can switch to a parser
specifically generated to parse Foo. For Foo’s only member bar, we don’t
even need to parse its type because we already know what it is from the containing struct’s field definition. In theory we can call a generated method which does the bare minimum
necessary for parsing a Vector{Int} and is type stable.

In practice, this is pretty close to how the third parser actually works,
although there is definitely room for improvement. It achieves this through
the use of two strategies. Firstly by using multiple-dispatch on concrete
types, thus allowing the Julia compiler to work its magic, creating type
specific code from generic methods.

In places where that doesn’t work, it uses @generated methods, which allow
us to explicitly generate code based on the types of the method arguments. I
doubt the below code will make much sense taken out of context (or in context
for that matter), but hopefully it can give you an idea of what this looks
like.

# This is called when we know what type 'T' we want. There are many
# definitions of parse_specific, this one is the most generic and so
# is called when all the others have failed to match. The assumption here is
# that T will be a struct type (if it is concrete) as all other types will
# match with a more specific parse_specific method (say what you like about my
# naming)
function parse_specific(io::IO, ::Type{T}, tag::BSONType,
                        ctx::ParseCtx)::T where T

  # structs are always represented by BSON documents
  # note that this runs normally at 'runtime' before any generated code or the
  # fallback
  @asserteq tag document

  # This tells the compiler this is a generated method. The code in the first
  # branch of this if statement is run at/before 'compile time' and returns an
  # AST (expression). The AST is then compiled, this is the same as for macros.
  # The difference is macro's have no access to type information.
  if @generated
    # If it is not a concrete type we fall back to a method which will first load
    # the embedded type information from the BSON document. This will happen
    # if (for example) you put abstract or union types in your struct definitions
    if !isconcretetype(T)
      return :(parse_tag(io, tag, ctx))
    end

    # If it is a concrete struct, this chunk of code looks through the BSON
    # document for members we are 
    # expecting in a serialised struct or reference to a struct and saves
    # pointers to what it finds. This can be simplified if we can garuantee
    # the order the elements will appear in, I am being cautious for now
    quote
      startpos = position(io)
      len = read(io, Int32)
      local data::BSONElem
      ref = nothing

      for _ in 1:4
        if (tag = read(io, BSONType)) == eof
          break
        end

        k = parse_cstr_unsafe(io)
        if k == b"tag"
          @asserteq tag string
        elseif k == b"ref"
          ref = BSONElem(tag, io)
        elseif k == b"type"
          @asserteq tag document
        elseif k == b"data"
          @asserteq tag array
          data = BSONElem(tag, io)
        end

        skip_over(io, tag)
      end

      endpos = position(io)
      @asserteq (startpos + len) endpos

      # Objects which appear in the data more than once are lowered into a special
      # array and their instances replaced with indexes/references into that
      # array before serialisation to BSON.
      if ref ≠ nothing
	    # Follow the reference into the array to get/parse the actual struct
        ret = parse_specific_ref(io, $T, ref, ctx)::$T
        seek(io, endpos)
        return ret
      end

      seek(io, data.pos)
	  # This is where most the work is done to reconstitute the struct T
      ret = load_struct(io, $T, data.tag, ctx)
      seek(io, endpos)
      ret
    end
  else
    # Fallback code to use if the compiler decides not to use the
    # generated version
    parse_tag(io::IO, tag, ctx)::T
  end
end

...

function load_struct(io::IO, ::Type{T}, dtag::BSONType, ctx::ParseCtx)::T where T
  if @generated

    # This time we have 4 alternative code blocks which may be produced
    # depending on the type of struct. Fistly we have some code to build C
    # style structs with a straight forward memory layout. These structs
    # only contain bits types themselves (like Int, Float, ComplexF64).
    if isprimitive(T)
      @assert isbitstype(T)
      quote
        @asserteq dtag binary
        bits = parse_bin_unsafe(io, ctx)
		# This calls a C function in the Julia runtime which copies 'bits'
		# directly into a new Julia struct.
        ccall(:jl_new_bits, Any, (Any, Ptr{Cvoid}), $T, bits)
      end
    elseif T <: Dict
	  # dictionaries have their own special representation in BSON
      :(load_dict!(io, $T(), ctx))
    elseif fieldcount(T) < 1
      :($T())
    else
      # Now we have to deal with the type of struct which has an
      # unknown memory layout and we have to build one field at a time.
      n = fieldcount(T)
      @assert n > 0
      FT = fieldtype(T, 1)

      block = :(x = ccall(:jl_new_struct_uninit, Any, (Any,), $T);
                setref(x, ctx);
                parse_array_len(io, ctx);
                tag = parse_array_tag(io, ctx);
                f = parse_specific(io, $FT, tag, ctx)::$FT;
                ccall(:jl_set_nth_field, Nothing, (Any, Csize_t, Any), x, 0, f))
      for i in 2:n
	# this loop is at compile time, we concatenate the previous contents
        # of 'block' with the current field expression. There will be no loop in the
        # final code just a list of type specific calls to parse and set field
		
        FT = fieldtype(T, i)
        block = :($block;
                  tag = parse_array_tag(io, ctx);
                  f = parse_specific(io, $FT, tag, ctx)::$FT;
                  ccall(:jl_set_nth_field, Nothing, (Any, Csize_t, Any), x, $i-1, f))
      end

      :($block; x)
    end
  else
    # Fallback code omitted, it looks similar to the above but without expressions
	...
  end
end

This results in a ~3.5x performance improvement in my benchmarks. Probably in
the best case, with only concrete types, it would be ~4x. However this is on
the second run with a large data set. On the first run, with any dataset which
is not huge, the compilation time can easily come to dominate and the
compilation time for the third parser is much longer. This is probably because
it must generate far more type specific code.

I am not sure exactly what to do next with this, if anything, although it
clearly would benefit from some cleanup, but who has time for that? I have
forked the package because I have practically rewritten most of the library
and changed its behavior (although it is still mostly compatible). Definitely
some of these improvements can be added to the original (with effort), for
others it is not so clear if they are wanted as no one has shown an interest
in it.

There is nothing strictly binding me to this particular BSON format either and
there is probably something intrinsically more efficient. For a start a lot of
type information could be omitted. On the other hand, I’d like to keep up the
pretense of being compatible with something that has an existing user base.