Reading Data Is Still Too Slow


#1

My crsp daily stock return data is a 6.4GB csv file. Compressed into csv.gz, it is about 1.7GB. (I posted a sample with my juliadb post, JuliaDB Getting Started...with save error

Local macos iMac Pro 2018. Fast SSD.

  • On R, reading the .csv.gz file into a data frame via datatables fread takes about 30 seconds.

  • On Julia, even reading (deserializing) the serialized 6.4GB file takes 330 seconds.

Previous experiments suggest that reading .csv.gz files into julia takes even longer, but I have not done this here. (My naive attempt failed, probably a stupid error, but not worth chasing down at this moment.)

julia> crsp= CSV.read( GzipDecompressorStream( open("crspdaily-clean.csv.gz", "r") ), missingstrings=[ "NA", "NaN", "" ] )
ERROR: MethodError: no method matching position(::TranscodingStreams.TranscodingStream{GzipDecompressor,IOStream})
Closest candidates are:
  position(::Base.SecretBuffer) at secretbuffer.jl:154
  position(::Base.Filesystem.File) at filesystem.jl:225
  position(::Base.Libc.FILE) at libc.jl:101
  ...

So, I am wondering what I should recommend to my students. How can they store and retrieve this large data set quickly enough to want to switch over to Julia?


De-Serialization Performance
#2

Could you try CSVFiles.jl on your dataset?

using DataFrames, CSVFiles, FileIO

crsp = load(File(format"CSV", "crspdaily-clean.csv.gz"), nastrings=[ "NA", "NaN", "" ]) |> DataFrame

# or for already uncompressed file
crsp = load("crspdaily-clean.csv", nastrings=[ "NA", "NaN", "" ]) |> DataFrame

This indicates that TextParse.jl is often faster (or en par) than CSV.jl. CSVFiles.jl uses TextParse.jl under the hood. The benchmarks I posted there showed an additional overhead when using CSVFiles.jl over pure TextParse.jl, but you can get rid of that entirely when using the latest master version of DataFrames.jl.

Having said that, R’s fread is a beast and there simply isn’t anything in julia currently that can deliver that level of performance. And I’m only talking about single threaded reads, once freads starts to use threads, it simply leaves everything else in the dust…


Problem using DataFrames master
#3

I’m not familiar with these things, but could you feasibly use a wrapper to call the underlying tool from Julia?


#4

After so many years optimization, pandas is still much slower than data.table in many functions. In terms of data importing, fread is the king. However, you can retrieve the data from crsp remotely with Julia API to postgresql server and save the data in hdf5 or fst format, then the reading could be faster. CSV is a row based data format and inherently slow, and I guess fread should be very close to the speed limit. If you want faster importing, use column based format.


#5

I think there is a project that is wrapping fread’s C code for python, but I think there also have been some issues around licensing etc. I didn’t follow the details, though. Presumably one could do the same from julia. I believe Apache Arrow is also building a new C based CSV parser that one could presumably wrap eventually.

On the other hand, there is still a lot of room for improvement in TextParse.jl. When I did my PR on that package to speed things up, I mainly just wanted to fix the performance regressions relative to julia 0.6, and then I stopped. But I do think there is a fair bit of potential to go beyond that, still. And once we get PARTR in julia base, we should be able to add a multi-threaded version as well. I think TextParse.jl is really well suited for that, the internal design is super functional and side-effect free, which should make it relatively straightforward to parallelize the code. And if you look at the performance tests I posted, really it is only fread that is clearly faster, TextParse.jl seems to be very competitive with many other packages.


#6

Thanks, everyone. I am not asking for csv reading speed here. (PS: the multithreading made R’s fread about twice as fast.)

I need some format— any format —that reads big data fast. I think (de)serialization is the best we have at the moment for raw speed. (as I wrote, I can’t get JuliaDB to work. maybe I should try sqlite? I presume that this would be faster than postgres?! anyother ideas?)

David—

julia> using DataFrames, CSVFiles, FileIO

julia> @time crsp = load(File(format"CSV", "crspdaily-clean.csv.gz"), nastrings=[ "NA", "NaN", "" ]) |> DataFrame
409.564670 seconds (4.75 G allocations: 180.736 GiB, 38.10% gc time)
88915607×12 DataFrame
...

400 seconds does not seem so bad, given that the “near-native” format takes >300 seconds.

julia> @time crsp=deserialize( open("crspdaily-clean.jls", "r") )
302.218386 seconds (4.03 G allocations: 85.443 GiB, 3.47% gc time)
88915607×12 DataFrame

The pure reading for a 6GB file to /dev/null is only about 1-2 seconds. I wonder why the serialization of a data frame takes so long. Presumably, it should write column info followed by an fwrite of the column. Without data conversion, it should be an order of magnitude faster than any csv based format.

PS: In R:

> system.time( { crsp=iaw$read.csv.gz("crspdaily-clean.csv.gz") } )
[read from crspdaily-clean.csv.gz : 88915607 rows,  12 cols]
   user  system elapsed
  57.29   14.78   26.63
>

#7

I have good luck with SASLib.jl reading big SAS files. I just read in a file that was 28 million rows and 24 columns (4.1GB size) and it took just over 30 seconds … Even though I was simultaneously running a resource intensive Matlab program!

So I’d suggest you try that for your CRSP file.


#8

SQLite is not much faster, either (290 seconds). Tyler—did you read the SAS file into a DataFrame? See, I am now wondering if the transmit of converted data into the DataFrame is the bottleneck.

Q: Is there a way to test

julia> @time r= (SQLite.Query(db, "SELECT * from crspdaily") |> DataFrame)
291.886328 seconds (2.89 G allocations: 77.278 GiB, 7.34% gc time)

without flowing the data into a DataFrame, but into /dev/null instead to see if the problem sits with the flow-in data interface to DataFrame?

/iaw

PS: I would think that this is a critical Julia problem. It is not good that .csv.gz is slow, but if one cannot work around it, it seems near deadly for big data analysis.


#9

Conversion to DataFrame from the object that readsas returns is very fast and efficient. Less than half a second for that particular table.


#10

To be clear, I don’t have a comparison handy at the moment. The 30 seconds on my machine could be much slower on yours, although I doubt it’d be that much slower. (I did just get a nice new NVMe SSD :grin:).

In general I’ve seen good performance from SASLib and that makes me happy because it makes the transition between SAS and Julia much easier.


#11

Isn’t SASlib.jl a read-only package?

(It would also be pretty annoying for julia to have to resort to ancient sas for its main format…but I guess it is better than no fast storage format.)


#12

If loading speed is a concern, CSV is probably suboptimal, Feather should be much faster to load, have you tried using it?


#13

yes, csv is suboptimal. I had hoped that serialization would have been, given some past tests.

alas, piever you are completely correct. Feather is the way to go:

julia> @time x=Feather.read("crspdaily-clean.feather")
  1.500487 seconds (4.97 M allocations: 251.940 MiB, 5.56% gc time)
88915607×12 DataFrame. Omitted printing of 1 columns

I will change my recommendation in http://julia.cookbook.tips/doku.php?id=fileformats . So, this is great news.

(PS: I will mark this post as the answer, although piever provided it, because it has the timing numbers.)


#14

My research back then still seem somewhat relevant Benchmarking ways to write/load DataFrames IndexedTables to disk


#15

It’s quite unexpected that deserialization is slower than CSV… Can you file an issue against Julia with a minimal reproducer? Ideally you’d provide a command to generate a data frame with random values. Or even better, if you can reproduce the problem without DataFrames, just serializing/deserializing the column vectors directly.


#16

hi milan—I am happy to help, but this file is pretty large and experimentation is slow. Plus, the data is proprietary, so it is difficult to convey. If someone from julia computing wants to try it on my machine (it sits on the internet), I can arrange access. regards, /iaw


#17

The timing from Feather.read is a bit misleading, though, because it actually doesn’t read any of the data from disc. All it does is read the metadata from disc (i.e. what columns are there in the file, what types do they have), and then it creates a DataFrame that will load things from disc on-demand when you actually access the values. So you will only pay the price of loading stuff from disc once you start to access the data (e.g. by indexing into the DataFrame). Of course, at that point these kind of data access operations on DataFrame will be much, much slower than with a normal DataFrame (because they actually include a trip to the disc). That all can be nice if one knows about it, but it of course means that you want to write very different algorithms and use special access patterns when you have one of these special DataFrames. I also think manipulating such a DataFrame probably also won’t work properly.

I think in theory Feather.materialize should give you a normal DataFrame, i.e. load everything from disc and load stuff into main memory, in a DataFrame. But at least on my system that seems not to work right now. But in any case, an apples-to-apples comparison of read performance would use Feather.materialize, not Feather.read.

FeatherFiles.jl is another option. You would read a file via:

using FeatherFiles, DataFrames

df = load("foo.feather") |> DataFrame

So the same API pattern you use for CSVFiles.jl. It will read everything fully from disc, i.e. give you a normal DataFrame.

The same caveat about overhead that I made with respect to CSVFiles.jl applies here as well: if you use the master branch of DataFrames.jl you’ll get much better performance than with the released version of DataFrames.jl.

You can get a comprehensive documentation about the file IO story in Queryverse.jl over here.


#18

@iwelch maybe you can also try this? 6G isn’t too big and I am assuming that you are reading everything into memory.


#19

@tk3369 We should add CSVReader.jl to https://github.com/davidanthoff/csv-comparison! Any chance you might register it in METADATA.jl, then I could add it into the mix for the comparison.


#20

What I meant is that we probably don’t care about the data, any set of columns of the same type containing random data should give similar timings. So you can write a few lines of code to create such a data frame, and if it reproduces the problem that will make it much easier to debug. If it doesn’t reproduce the problem, then that will be an interesting piece of information (meaning particular values matter).