On R, reading the .csv.gz file into a data frame via datatables fread takes about 30 seconds.
On Julia, even reading (deserializing) the serialized 6.4GB file takes 330 seconds.
Previous experiments suggest that reading .csv.gz files into julia takes even longer, but I have not done this here. (My naive attempt failed, probably a stupid error, but not worth chasing down at this moment.)
julia> crsp= CSV.read( GzipDecompressorStream( open("crspdaily-clean.csv.gz", "r") ), missingstrings=[ "NA", "NaN", "" ] )
ERROR: MethodError: no method matching position(::TranscodingStreams.TranscodingStream{GzipDecompressor,IOStream})
Closest candidates are:
position(::Base.SecretBuffer) at secretbuffer.jl:154
position(::Base.Filesystem.File) at filesystem.jl:225
position(::Base.Libc.FILE) at libc.jl:101
...
So, I am wondering what I should recommend to my students. How can they store and retrieve this large data set quickly enough to want to switch over to Julia?
Having said that, R’s fread is a beast and there simply isn’t anything in julia currently that can deliver that level of performance. And I’m only talking about single threaded reads, once freads starts to use threads, it simply leaves everything else in the dust…
After so many years optimization, pandas is still much slower than data.table in many functions. In terms of data importing, fread is the king. However, you can retrieve the data from crsp remotely with Julia API to postgresql server and save the data in hdf5 or fst format, then the reading could be faster. CSV is a row based data format and inherently slow, and I guess fread should be very close to the speed limit. If you want faster importing, use column based format.
I think there is a project that is wrapping fread’s C code for python, but I think there also have been some issues around licensing etc. I didn’t follow the details, though. Presumably one could do the same from julia. I believe Apache Arrow is also building a new C based CSV parser that one could presumably wrap eventually.
On the other hand, there is still a lot of room for improvement in TextParse.jl. When I did my PR on that package to speed things up, I mainly just wanted to fix the performance regressions relative to julia 0.6, and then I stopped. But I do think there is a fair bit of potential to go beyond that, still. And once we get PARTR in julia base, we should be able to add a multi-threaded version as well. I think TextParse.jl is really well suited for that, the internal design is super functional and side-effect free, which should make it relatively straightforward to parallelize the code. And if you look at the performance tests I posted, really it is only fread that is clearly faster, TextParse.jl seems to be very competitive with many other packages.
Thanks, everyone. I am not asking for csv reading speed here. (PS: the multithreading made R’s fread about twice as fast.)
I need some format— any format —that reads big data fast. I think (de)serialization is the best we have at the moment for raw speed. (as I wrote, I can’t get JuliaDB to work. maybe I should try sqlite? I presume that this would be faster than postgres?! anyother ideas?)
The pure reading for a 6GB file to /dev/null is only about 1-2 seconds. I wonder why the serialization of a data frame takes so long. Presumably, it should write column info followed by an fwrite of the column. Without data conversion, it should be an order of magnitude faster than any csv based format.
PS: In R:
> system.time( { crsp=iaw$read.csv.gz("crspdaily-clean.csv.gz") } )
[read from crspdaily-clean.csv.gz : 88915607 rows, 12 cols]
user system elapsed
57.29 14.78 26.63
>
I have good luck with SASLib.jl reading big SAS files. I just read in a file that was 28 million rows and 24 columns (4.1GB size) and it took just over 30 seconds … Even though I was simultaneously running a resource intensive Matlab program!
SQLite is not much faster, either (290 seconds). Tyler—did you read the SAS file into a DataFrame? See, I am now wondering if the transmit of converted data into the DataFrame is the bottleneck.
Q: Is there a way to test
julia> @time r= (SQLite.Query(db, "SELECT * from crspdaily") |> DataFrame)
291.886328 seconds (2.89 G allocations: 77.278 GiB, 7.34% gc time)
without flowing the data into a DataFrame, but into /dev/null instead to see if the problem sits with the flow-in data interface to DataFrame?
/iaw
PS: I would think that this is a critical Julia problem. It is not good that .csv.gz is slow, but if one cannot work around it, it seems near deadly for big data analysis.
To be clear, I don’t have a comparison handy at the moment. The 30 seconds on my machine could be much slower on yours, although I doubt it’d be that much slower. (I did just get a nice new NVMe SSD ).
In general I’ve seen good performance from SASLib and that makes me happy because it makes the transition between SAS and Julia much easier.
(It would also be pretty annoying for julia to have to resort to ancient sas for its main format…but I guess it is better than no fast storage format.)
It’s quite unexpected that deserialization is slower than CSV… Can you file an issue against Julia with a minimal reproducer? Ideally you’d provide a command to generate a data frame with random values. Or even better, if you can reproduce the problem without DataFrames, just serializing/deserializing the column vectors directly.
hi milan—I am happy to help, but this file is pretty large and experimentation is slow. Plus, the data is proprietary, so it is difficult to convey. If someone from julia computing wants to try it on my machine (it sits on the internet), I can arrange access. regards, /iaw
The timing from Feather.read is a bit misleading, though, because it actually doesn’t read any of the data from disc. All it does is read the metadata from disc (i.e. what columns are there in the file, what types do they have), and then it creates a DataFrame that will load things from disc on-demand when you actually access the values. So you will only pay the price of loading stuff from disc once you start to access the data (e.g. by indexing into the DataFrame). Of course, at that point these kind of data access operations on DataFrame will be much, much slower than with a normal DataFrame (because they actually include a trip to the disc). That all can be nice if one knows about it, but it of course means that you want to write very different algorithms and use special access patterns when you have one of these special DataFrames. I also think manipulating such a DataFrame probably also won’t work properly.
I think in theory Feather.materialize should give you a normal DataFrame, i.e. load everything from disc and load stuff into main memory, in a DataFrame. But at least on my system that seems not to work right now. But in any case, an apples-to-apples comparison of read performance would use Feather.materialize, not Feather.read.
FeatherFiles.jl is another option. You would read a file via:
using FeatherFiles, DataFrames
df = load("foo.feather") |> DataFrame
So the same API pattern you use for CSVFiles.jl. It will read everything fully from disc, i.e. give you a normal DataFrame.
The same caveat about overhead that I made with respect to CSVFiles.jl applies here as well: if you use the master branch of DataFrames.jl you’ll get much better performance than with the released version of DataFrames.jl.
You can get a comprehensive documentation about the file IO story in Queryverse.jl over here.
What I meant is that we probably don’t care about the data, any set of columns of the same type containing random data should give similar timings. So you can write a few lines of code to create such a data frame, and if it reproduces the problem that will make it much easier to debug. If it doesn’t reproduce the problem, then that will be an interesting piece of information (meaning particular values matter).