I need to be very parsimonious in order to keep it in memory (I will eventually have ~2M trees of size ~400), hence the Int8s and explicit types in there. Each tree is about 10kB according to Base.summarysize().
Serializing 21580 of these trees in a vector takes 30s. The resulting .jls file is about 90MB. I know from experience that Julia can (de)serialize 90MB files in an instant.
I have two guesses for what the problem is:
The Int8s
Each tree has a slightly different shape, so allocating memory cannot be done in advance
but I’m not sure how to solve either one. Maybe there’s an aspect of the serializer or the type system or something that I haven’t sussed out yet, that can help me make this fast?
I know MWEs are useful. In this case, creating random data that has the same structure as my actual data is not so easy. I will work on it anyway, but in the meantime here is the actual data: test.jls
It was serialized with Julia 1.7.1 and takes me 28s to deserialize on my machine, with using Serialization; deserialize("test.jls")
I note that you ask me to use @btime rather than just @time – which suggests you think the slowness is due to compilation time, right? So you think it is my second guess? Unfortunately, this is an action that will be run once after Julia is first started, so I cannot take advantage of compile times on subsequent uses. At any rate, the current time for first deserialization will be measured in days which is unacceptable.
Edit to add: @btime returns 24.913 s (45550300 allocations: 1.99 GiB) so it’s not compilation time after all.
Can you profile this with Profile.@profile and post the results of Profile.print() (via a pastebin)? Make sure to only post the results after a second run, after clearing the profile buffer with Profile.clear() and re-running Profile.@profile to eliminate overhead from compilation.
Aside: why is this slow by default? Because serialize_any and friends use @nospecialize annotations to prevent excessive compilation when doing ser/des. This makes messages with fewer large objects very efficient, but also means that many smaller objects are much more taxing, due to the dynamic dispatch overhead. It’s a tradeoff that was likely intentionally chosen to ensure decent performance in the common case (since most usages of serialization are likely serializing very large but simple objects).
Thanks very much for the info! I already did a quick search on custom serialization and came up empty, so your pointer to an existing example is much appreciated.
I have a different analysis. The vast majority of time appears to be spent in the serialize_cycle routine, and while almost all time is spent in functions with dynamic dispatch, the overead of those functions (i.e. the cost of dynamic dispatch) appears to be low. We can see this because the bars above the red lines occupy almost the entire width of the red lines.
Further, a large portion of the overhead appears to be in evaluating iddict’s get, a command used for cycle serialization.
I suspect that the issue is that you are serializing a data structure that could theoretically (although in your case doesn’t) contain pointer cycles, and serialize is spending 99% of its time checking for cycles.
The following addendum speeds things up 5x, but is still substantially slower than serializing a vector
Oh, brilliant idea just to have a single make_immutable pass after building the tree! This is a write-once, read-many situation so that is a very easy improvement I can make.
I spent a couple of hours last night working through Serialization.jl with an eye toward a custom dispatch for serialize. In the meantime, it was easy enough to flatten the tree into a vector of Ints representing all the childvalues instances and serialize that instead. Even with the overhead of flattening and expanding, I’ve already gotten a huge increase in throughput.
This has (not so?) surprising similarity to the transformation necessary to preallocate vectors as backing store for node based containers like lists and trees.