Specific Topic Related to Reading Data Is Still Too Slow .
I spent the morning figuring out how serialization/deserialization perform. (I was writing more complex programs, but ultimately it whittled down to the following.)
I don’t think I have run into an “unusual” slowness problem. There is a more basic problem with serialization: Lack of Heuristics.
On a mac pro 3.2GHz Xeon W 64GB system, it takes about 10 seconds to deserialize an Int64 vector of 100 million elements. On disk, this vector takes less than 1GB. (It takes about 30 seconds for an equivalent Float64 vector.)
Let me put this in perspective. The pure Int64 input from disk takes less than 0.01 seconds. If the Int64 Vector data was binary, julia would basically be done. For a silly comparison, R’s fread takes 5 seconds wall-clock time to read 12x as many columns in CSV format, converting them, putting them into a dataframe, etc., albeit using many cores. We are very far into “gd-awful performance” for what may well be the most common use cases for large data sets.
So, my suggestion is to use more special-case intelligence. Long Vectors of Float32, Float64, Int32, and Int64 (perhaps also with missing) should be dumped/restored as a binary stream. This should yield a deserialization speedup of 1-2 orders of magnitude. [In my case, instead of 300 seconds, my deserialize would be 3 to 30 seconds.]
I hope this helps.