De-Serialization Performance


#1

Specific Topic Related to Reading Data Is Still Too Slow .

I spent the morning figuring out how serialization/deserialization perform. (I was writing more complex programs, but ultimately it whittled down to the following.)

I don’t think I have run into an “unusual” slowness problem. There is a more basic problem with serialization: Lack of Heuristics.

On a mac pro 3.2GHz Xeon W 64GB system, it takes about 10 seconds to deserialize an Int64 vector of 100 million elements. On disk, this vector takes less than 1GB. (It takes about 30 seconds for an equivalent Float64 vector.)

Let me put this in perspective. The pure Int64 input from disk takes less than 0.01 seconds. If the Int64 Vector data was binary, julia would basically be done. For a silly comparison, R’s fread takes 5 seconds wall-clock time to read 12x as many columns in CSV format, converting them, putting them into a dataframe, etc., albeit using many cores. We are very far into “gd-awful performance” for what may well be the most common use cases for large data sets.

So, my suggestion is to use more special-case intelligence. Long Vectors of Float32, Float64, Int32, and Int64 (perhaps also with missing) should be dumped/restored as a binary stream. This should yield a deserialization speedup of 1-2 orders of magnitude. [In my case, instead of 300 seconds, my deserialize would be 3 to 30 seconds.]

I hope this helps.

/iaw


Reading Data Is Still Too Slow
#2

I’ve just tried locally, and it takes about half a second to serialize or deserialize such a vector. Am I missing something? That’s a two-year-old laptop with an SSD and a Skylake i5 CPU under Linux.

julia> x = rand(Int, 100_000_000);

julia> @time serialize(open("test", "w"), x)
  0.476476 seconds (19 allocations: 1.781 KiB)
800000000

julia> @time y = deserialize(open("test.jls"));
  0.478984 seconds (19 allocations: 762.941 MiB, 20.03% gc time)

AFAIK that’s already how it works. That’s the point of serialization.


#3

Sorry, I stand corrected. What messes it up is the presence of Missing.

julia> using DataFrames, Serialization

julia> @time d=deserialize( open( "permno.jls") )
  8.719186 seconds (356.19 M allocations: 7.896 GiB, 7.77% gc time)
88915607-element Array{Union{Missing, Int64},1}:
 10000

julia> @time( deserialize("nomissing.jls"))
  0.767161 seconds (1.51 M allocations: 755.579 MiB, 5.54% gc time)
88915607-element Array{Int64,1}:
 10000
...

These are the same vectors. (Even 0.76 seconds seems slow, of course, but it is already reasonable.)

I think serialize() needs a little more intelligence then—missing. It has become such a core part of the data-related aspects of the language.

As to internals, why would a deserialization need millions of allocations? Can’t it store the vector length first and do one giant allocation first? The start of the vector would also be a good place to store a special bit pattern to designate missing values. It just needs to be checked against non-presence on serialization.

/iaw


#4

Here the test to copy-paste:

julia> using Serialization

julia> x = rand(Int, 100_000_000);

julia> xm = convert(Vector{Union{Int,Missing}}, x); xm[1:1000:end] .= missing;

julia> @time serialize(open("/tmp/test.jls", "w"), x)
  0.376715 seconds (119.99 k allocations: 5.847 MiB)
800000000

julia> @time y = deserialize(open("/tmp/test.jls"));
  0.570197 seconds (7.28 k allocations: 763.273 MiB, 40.02% gc time)

julia> @time serialize(open("/tmp/test.jls", "w"), xm)
  3.338387 seconds (100.00 M allocations: 1.990 GiB, 9.15% gc time)

julia> @time y = deserialize(open("/tmp/test.jls"));
 56.912022 seconds (399.90 M allocations: 8.787 GiB, 0.97% gc time)

In my case it is reading/writing to a ram-disc, which should be about as fast as RAM. The file sizes are similar at 763MB and 858MB. Timings are similar on 1.0.2 and master. Looks issue-worthy to me.


#5

OK. The serialization code probably needs to use the same tricks as the in-memory storage (i.e. store one vector for values, and one for the type tag). Worth filing an issue indeed. Cc: @quinnj


#6

can I please leave it to you to file an issue? it will be smarter than what I can file.

Alas, before this thread is over, can I also ask why even the non-missing vector needs to go through so many allocations; and presumably why even ordinary vectors take an order of magnitude longer than the direct input time? The vector length should be known in advance, and the speed should be roughly the equivalent of a C fread(), plus a little bit of overhead.


#7

I’ve filed this:

No idea. But what do you mean by “the direct input time”? In general, please post reproducible examples, especially since it’s very easy in this case. We would have immediately spotted that it was related to missing values if you had done so directly in the OP.


#8

I would post it if I really knew what I was talking about ;-).

I just figured out why I was confused. For the record, in Mauro3’s example:

julia> @time y = deserialize(open("/tmp/test.jls"));
 0.570197 seconds (7.28 k allocations: 763.273 MiB, 40.02% gc time)

So, I am staring at this, and I am thinking that there are 7.28k allocations. I wondered “If the vector saved its length as a first argument, couldn’t this become 1 allocation and a lot less time spent on gc?”

Alas, the number of allocations isn’t necessarily the serialization. On my system, just repeating this three times

julia> @time y = deserialize(open("/tmp/test.jls"));
  0.343549 seconds (30.10 k allocations: 764.407 MiB, 3.46% gc time)

julia> @time y = deserialize(open("/tmp/test.jls"));
  0.401922 seconds (26 allocations: 762.942 MiB, 21.55% gc time)

julia> @time y = deserialize(open("/tmp/test.jls"));
  0.354141 seconds (19 allocations: 762.941 MiB, 11.41% gc time)

the number of allocations varies from 30,100 to 19. that’s quite a distance! why so much variation? and what does it mean? Alas, I misinterpreted this. it appears that the gc time is not really related to the number of allocations, and it is the gc time that produces the variation in time to completion, not the number of allocations.

in any case, I now understand that the julia is within striking distance of a pure disk-to-memory transfer in speed on pure vectors. so, there is nothing to improve.

this was a blind alarm on my end. sorry.

I am really looking forward to a serialization/deserialization with the new DataFrames and good handling of missing!


#9

I think you misunderstand — you should post an MWE especially if you don’t fully know what you are doing. Not doing so just makes it much more difficult for people who want to help you.

Probably because the first time you call the function with these types, it needs to compile. The compiler itself allocates.


#10

It would be a good idea for you to read through https://docs.julialang.org/en/v1/manual/performance-tips/index.html

For example:

On the first call ( @time sum_global() ) the function gets compiled. (If you’ve not yet used @time in this session, it will also compile functions needed for timing.) You should not take the results of this run seriously.


#11

Speaking about wasted time, it seems that OP spent morning to find this issue.

Edit: Thanks Ivo for help to make Julia better! :slight_smile:


#12

thanks for the acknowledgement. yes, I did spend quite a bit of time figuring out why this was so bad.

for end-user beta testers like me, assuming/if you want my input:

I have read the performance tips twice, but my memory is not that good. ideally, one would not have to remember all docs and faqs and tips, of course, but the obvious solution should work. and when it does not, it should elicit warning messages by the compiler when one runs into them (unless warnings are turned off). A language with many gotchas is not very user friendly.

of course, julia is not a mature ecosystem language. when one runs into a problem (e.g., my juliadb save fail), there is as good a chance that one has run into a bug (or documentation error) as that one has programmed it wrong. if I knew it was the latter, i.e., that it was my fault, I would spend the day to hunt down possible docs. right now, with my time precious, too, I only spend about half an hour to an hour hunting for solutions before asking here.

I am not complaining. I know that I am a guinea pig. But it’s not as obvious as RTFM. Heck, I am beta-testing julia for masochistic “fun.” right now, I don’t even have a clear use case yet. If I am taking more time off the friendly folks here, then I can wait until julia is more stable and let others hunt for the problems. I will be happy to come back in a year or two.

/iaw


#13

I for one just appreciate any feedback, in whatever form it comes. So, thanks!