Serialize() is prohibitively slow for recursive structs. What am I missing?

I am representing some data with a large number of tree structs:

mutable struct MinimalNode
    childvalues::Union{NTuple{4, Int8}, NTuple{3, Int8}, NTuple{2, Int8}, NTuple{1, Int8}, Tuple{}};
    childnodes::Union{NTuple{4, MinimalNode}, NTuple{3, MinimalNode}, NTuple{2, MinimalNode}, NTuple{1, MinimalNode}, Tuple{}};
end

I need to be very parsimonious in order to keep it in memory (I will eventually have ~2M trees of size ~400), hence the Int8s and explicit types in there. Each tree is about 10kB according to Base.summarysize().

Serializing 21580 of these trees in a vector takes 30s. The resulting .jls file is about 90MB. I know from experience that Julia can (de)serialize 90MB files in an instant.

I have two guesses for what the problem is:

  • The Int8s
  • Each tree has a slightly different shape, so allocating memory cannot be done in advance

but I’m not sure how to solve either one. Maybe there’s an aspect of the serializer or the type system or something that I haven’t sussed out yet, that can help me make this fast?

It would be helpful if you could provide a minimal working example that contains:

  1. the type definition (as above)
  2. the code to create a tree of the desired size, filled with random data
  3. a function that serializes 21580 of these trees
  4. a call of this function using @btime from BenchmarkTools.jl

I know MWEs are useful. In this case, creating random data that has the same structure as my actual data is not so easy. I will work on it anyway, but in the meantime here is the actual data: test.jls

It was serialized with Julia 1.7.1 and takes me 28s to deserialize on my machine, with using Serialization; deserialize("test.jls")

I note that you ask me to use @btime rather than just @time – which suggests you think the slowness is due to compilation time, right? So you think it is my second guess? Unfortunately, this is an action that will be run once after Julia is first started, so I cannot take advantage of compile times on subsequent uses. At any rate, the current time for first deserialization will be measured in days which is unacceptable.

Edit to add: @btime returns 24.913 s (45550300 allocations: 1.99 GiB) so it’s not compilation time after all.

Can you profile this with Profile.@profile and post the results of Profile.print() (via a pastebin)? Make sure to only post the results after a second run, after clearing the profile buffer with Profile.clear() and re-running Profile.@profile to eliminate overhead from compilation.

Sure, here you go: https://pastebin.com/M4g06vtc

I am asked for a password when clicking on this link …

sorry, my first time using pastebin. Try again please.

Now the link works… :slight_smile:

OK, here is a MWE. (a previous version had some randomness in the structure; that randomness is not essential).

The nodes generated by addbranches! are about 25kB in size, and a vector of 10,000 of them takes about 30s to serialize on my machine.

mutable struct MinimalNode
    childvalues::Union{NTuple{4, Int8}, NTuple{3, Int8}, NTuple{2, Int8}, NTuple{1, Int8}, Tuple{}};
    childnodes::Union{NTuple{4, MinimalNode}, NTuple{3, MinimalNode}, NTuple{2, MinimalNode}, NTuple{1, MinimalNode}, Tuple{}};
end

function addbranches!(mn::MinimalNode, aleft::Int64, bleft::Int64, parity::Int64, depth = 0)

    (depth == 12) && (return nothing);

    if parity == 1
        (aleft <= 1) && (return nothing);
        mn.childvalues = Tuple(1:aleft);
        mn.childnodes = Tuple([MinimalNode((), ()) for ii in 1:aleft]);
        aleft -= 1;
    else
        (bleft <= 1) && (return nothing);
        mn.childvalues = Tuple(1:bleft);
        mn.childnodes = Tuple([MinimalNode((), ()) for ii in 1:bleft]);
        bleft -= 1;
    end

    addbranches!.(mn.childnodes, aleft, bleft, 3-parity, depth + 1);
    
end

function buildfaketree()
    root = MinimalNode((), ());
    addbranches!(root, 4, 4, 1, 0);
    return root;
end


MN = [buildfaketree() for ii in 1:10000];
using Serialization;
@time serialize("test.jls", MN);
1 Like

An just in case it’s useful, the flame graph from ProfileView.@profview serialize("test.jls", MN):

All of the red bars are serialize_any calls. I chose one at random; here is the warntype_last():

MethodInstance for Serialization.serialize_any(::Serializer{IOStream}, ::Any)
  from serialize_any(s::AbstractSerializer, x) in Serialization at C:\Users\rNr\AppData\Local\Programs\Julia-1.7.1\share\julia\stdlib\v1.7\Serialization\src\Serialization.jl:655
Arguments
  #self#::Core.Const(Serialization.serialize_any)
  s::Serializer{IOStream}
  x::Any
Locals
  @_4::Union{Nothing, Tuple{Int64, Int64}}
  nf::Int64
  t::DataType
  tag::Int32
  i::Int64
Body::Nothing
1 ──       nothing
│          Core.NewvarNode(:(@_4))
│          Core.NewvarNode(:(nf))
│          Core.NewvarNode(:(t))
│          (tag = Serialization.sertag(x))
│    %6  = (tag > 0)::Bool
└───       goto #3 if not %6
2 ── %8  = Base.getproperty(s, :io)::IOStream
│    %9  = Serialization.write_as_tag(%8, tag)::Core.Const(nothing)
│    %38 = Core.getfield(%36, 2)::Int64
│    %39 = Serialization.isdefined(x, i)::Bool
└───       goto #14 if not %39
13 ─ %41 = Serialization.getfield(x, i)::Any
│          Serialization.serialize(s, %41)
└───       goto #15
14 ─ %44 = Base.getproperty(s, :io)::IOStream
└───       Serialization.writetag(%44, Serialization.UNDEFREF_TAG)
15 ┄       (@_4 = Base.iterate(%31, %38))
│    %47 = (@_4 === nothing)::Bool
│    %48 = Base.not_int(%47)::Bool
└───       goto #17 if not %48
16 ─       goto #12
17 ┄       return Serialization.nothing

The only red in the actual output are the Argument x::Any, the Local t::DataType, and the line: %41 = Serialization.getfield(x, i)::Any.

1 Like

Yeah, that’s what I figured might be happening: you’re hitting the slow dynamic dispatch path. What you probably want to do for deserialization is define a custom deserialize method that will decode your nested structs until you hit some other kind of object (or it terminates). It’s not particularly easy to write these, but you could see https://github.com/JuliaData/MemPool.jl/blob/d6e92519259df3e5cda57ee515b5b5bd6f69ddcf/src/datastore.jl#L31-L41 for an example. You can do something similar for serialization, like in https://github.com/JuliaData/MemPool.jl/blob/d6e92519259df3e5cda57ee515b5b5bd6f69ddcf/src/datastore.jl#L19-L22.

Aside: why is this slow by default? Because serialize_any and friends use @nospecialize annotations to prevent excessive compilation when doing ser/des. This makes messages with fewer large objects very efficient, but also means that many smaller objects are much more taxing, due to the dynamic dispatch overhead. It’s a tradeoff that was likely intentionally chosen to ensure decent performance in the common case (since most usages of serialization are likely serializing very large but simple objects).

1 Like

Thanks very much for the info! I already did a quick search on custom serialization and came up empty, so your pointer to an existing example is much appreciated.

I have a different analysis. The vast majority of time appears to be spent in the serialize_cycle routine, and while almost all time is spent in functions with dynamic dispatch, the overead of those functions (i.e. the cost of dynamic dispatch) appears to be low. We can see this because the bars above the red lines occupy almost the entire width of the red lines.

Further, a large portion of the overhead appears to be in evaluating iddict’s get, a command used for cycle serialization.

I suspect that the issue is that you are serializing a data structure that could theoretically (although in your case doesn’t) contain pointer cycles, and serialize is spending 99% of its time checking for cycles.

The following addendum speeds things up 5x, but is still substantially slower than serializing a vector

struct ImmutableMinimalNode2
    childvalues::Union{NTuple{4, Int8}, NTuple{3, Int8}, NTuple{2, Int8}, NTuple{1, Int8}, Tuple{}};
    childnodes::Union{NTuple{4, ImmutableMinimalNode2}, NTuple{3, ImmutableMinimalNode2}, NTuple{2, ImmutableMinimalNode2}, NTuple{1, ImmutableMinimalNode2}, Tuple{}};
end
function make_immutable(node::MinimalNode)
    ImmutableMinimalNode2(node.childvalues, make_immutable.(node.childnodes))
end

immutable_MN = make_immutable.(MN)

@elapsed serialize("immutable_test.jls", immutable_MN)

I’ve explained and eliminated 80% of the runtime, but there is still a ways to go.

5 Likes

Oh, brilliant idea just to have a single make_immutable pass after building the tree! This is a write-once, read-many situation so that is a very easy improvement I can make.

I spent a couple of hours last night working through Serialization.jl with an eye toward a custom dispatch for serialize. In the meantime, it was easy enough to flatten the tree into a vector of Ints representing all the childvalues instances and serialize that instead. Even with the overhead of flattening and expanding, I’ve already gotten a huge increase in throughput.

1 Like

I’ve made https://github.com/JuliaLang/julia/issues/44175 to track this issue.

1 Like

This has (not so?) surprising similarity to the transformation necessary to preallocate vectors as backing store for node based containers like lists and trees.

1 Like