Benchmarking ways to write/load DataFrames IndexedTables to disk

Thanks, this stuff is super helpful!

Here are some conclusions from this re CSV files:

  • we are in really good shape in terms of CSV reading performance. @shashi’s TextParse.jl is simply awesome. If I read this correctly, it beats both the Python/pandas story (narrowly) and the R/data.table story (pretty clearly). There is another twist to this: Right now CSVFiles.jl introduces an entirely unnecessary overhead into the story. I’m super close to getting rid of that. But even with that overhead we are faster than the Python and R story. But, that should get even better going forward. One interesting test to add would be a pure TextParse.jl story that skips the CSVFiles/FileIO integration. Once I fix the thing in CSVFiles.jl we should essentially see that performance from CSVFiles.jl going forward. The test to add would just be csvread("filename.csv").
  • For CSV writing, CSVFiles.jl right now takes the second spot. It beats the Python/pandas story (clearly), but not the R/data.tables story. We of course know why that is: data.tables CSV writing is multi-threaded, ours is not. I would love to see another benchmark added to this pile: fwrite with nThread=1. That would tell whether fwrite is faster because of multi-threading or whether it has additional other optimizations.

Oh, one more question: which version of DataFrames.jl is used for these comparisons? v0.10 or v0.11?

Thank you so much for your suggestions. I will try to add more if possible.
In fact, I am just a researcher in Urban Transportation and not familiar with multiple-threading / multiple-cores programming. If the default package can utilize multiple cores automatically, it will be very useful for the users like me who can just use Julia language like Matlab, as a tool for research, engineering or teaching. Here, I just show some information based on which I selected the package for a research project.
By the way, thanks for your great work on julia-vscode. I like it very much!

1 Like

The version of DataFrames.jl is v0.11.5

Totally agreed, and I’m not suggesting to replace the normal fwrite test with one that uses only one core. The “fair” comparison is to give each package its best shot. I’d just like to see the single core performance of fwrite as well for my own curiosity. If that is comparable to what we have with CSVFiles.jl, it would suggest that we are doing as good as we can right now and just have to wait until we get better threading in julia to improve the situation. But if single-core fwrite beats us in a significant way, then we should dig in and try to see whether we can improve performance of CSVFiles.jl today.

What compression options are being used by the different packages?

fst is using default compression parameters. I don’t think Feather not JLD has compression options. This reminds me. Should add R’s rds files to the mixed with compression on or off

@davidanthoff Totally agree! Some other package such as Dask of Python (Dask — Dask documentation) can even automatically use multi-cores in the clustering environment. The comparison results would be useful for the package developers only with the same calculation resources.
I saw there is a data benckmark project by @xiaodai, which would be very useful (https://github.com/xiaodaigh/DataBench.jl). Currently, I mainly use Python for my daily work and only a beginner of Julia. Maybe @xiaodai could give more useful results to help the development of the Julia package.

Dask, JuliaDB.jl, SAS, and Spark on a single machine (I know most of them can be distributed) belong to the medium data tool set. They are not just in-memory tools like pandas and data.table. I think they should get their own benchmarks; but also they can be benchmarked in this suite as well. Spark uses Parquet.jl and HDFS, so at some point should add those to the benchmarks.

1 Like

If you are not limiting your study to CSV, JuliaDB can also save in binary format: you could benchmark JuliaDB.save and JuliaDB.load for completeness.

To clarify here, what good benchmarks can really do is find out exactly what assumptions the algorithms are making and how that effects their performance. It can be very helpful to us to know not only that there is a difference, but what we would need to do in order to overcome this difference. Single core to multithreaded comparisons might say what’s going on to the user, but it’s not helpful to us without the single core comparison (are we doing something fundamentally different? Or do we just need to multithread?). Even as a beginner of Julia, you can always help by benchmarking to find cases where algorithms are doing well and where they are not, and what that means about the implementation. Anyways, if you do this, you’ll find yourself knowing more about the implementations than anyone else quite quickly.

4 Likes

Let me just second what @ChrisRackauckas wrote. These kind of benchmarks really, really help me prioritize things in the universe of packages I maintain. My main constraint is time, so I’m just incredibly grateful to folks that create and run these benchmarks, it is a fantastic way to help the creators of packages with their work!

2 Likes

@davidanthoff I would like to add more tests according to the advice from you and the community. It would also be a good way to learn the Julia language.

Updated with @zhangliye’s code. R’s feather implementation is quite a bit faster than Julia’s. This can probably be improved. Actually data.table’s fwrite is actually very very fast and is competitive with fst.

@davidanthoff Looks like CSV.jl has a reasonably fast reader, on par with Pandas and data.table, in this case which is reading in 1m rows with 9 columns of mixed string, float and integer types. I am interested to test this out on a largish real-world dataset e.g. Fannie Mae to see how it stacks up, last time I tried it it didn’t compare so favourably.

Hello all, as I noted on the other thread, I’m am in the process of completely rewriting Feather.jl to use my new Arrow.jl which is a back-end for serializing and deserializing any Apache Arrow formatted data (which Feather is). I don’t actually expect it to be any faster than the old version of Feather (the reason for rewriting it was more about cleaning up code and expanding functionality), but I do expect it to be very performant particularly for non-nullable bits types. I expect things to change significantly in 0.7 as Arrow relies strongly on reinterpret, the behavior of which has been drastically altered in 0.7. Once there are some 0.7 release candidates I will spend some time trying to optimize Feather.jl if I have it. At that point I’ll try to post it as a benchmark with the others.

6 Likes

I’ve done some performance testing on the new Feather.jl (my fork, still a PR). I read in a 20 million row data frame of mostly “worst case scenario” data (strings that may be missing) that was about 5.6 GB. Took the new feather about 19 seconds. Took about 14 seconds for python feather (keep in mind that’s pretty much all C++). A big chunk of the time it took Julia was because of the inefficiency of Union types in 0.6, so hopefully we’ll be able to beat python easily in 0.7. I haven’t thoroughly tested the “best case scenario” (non-missings and bits types) yet, but it is pretty much as fast as it could possibly be.

The new Feather will also take advantage of random access capability (memory mapping with lazy loading), so most of the time it will not be necessary for you to actually load in all the data.

So, there’s light at the end of the tunnel for deserialization!

By the way, in the benchmarks you show above, are you absolutely sure that R is actually deserializing all the data, and not using some sort of memory mapping or lazy loading scheme?

8 Likes

Just looked at your awesome Arrow.jl package. Wonder how it’s doing in v1?

As of the last couple of releases, Arrow.jl only works on Julia 1.0, so no worries there.

Note that this package was built primarily with Feather in mind, so we are still missing all the arrow IPC stuff (it’s really just arrays at the moment). As far as I know, at the moment Feather.jl is the fastest way to read and write tabular data from disk in pure Julia. Strings are known to be a little slow, but they are still quite fast. Also note that Feather.jl does lazy loading by default now. Unfortunately we are still limited to 4GB per feather file because the feather standard badly needs to be upgraded, but stay tuned for that.

What about Parquet? It seems more promising to be as it allows compression of the files.

If my memory serves me correctly, I don’t think this is true anymore.

It depends on what you are doing. Parquet is definitely a more capable format than Feather, but it is also quite a bit more complicated. I think the intended use case for Feather was just “well, I have a few GB of data that I’m messing around with, I need to store it while I’m working on it for the next few weeks” whereas Parquet was intended as a data source for “production” processes.

To answer your question yes, you should be able to use the arrays provided by Arrow.jl to wrap arrays that appear in Parquet. I have yet to take a serious look at the parquet metadata, so I don’t really know what would be involved in doing that. I’m pretty confident that it would be significantly more complicated than Feather.

If there was any change to the standard, it doesn’t seem to be reflected here. Like I said, I have a number of thoughts on how we should go about this sort of thing, so I may have some news in the future.