CSV read performance vs Pandas

Are there any strategies that can further improve the the performance when multithreading comes around?

I think there are several ways to speed up reading. The current bottleneck of CSV reading is not in tokenization but in parsing. TableReader.jl parses independent columns one by one but this is obviously parallelizable (vertical parallelization). Or, first count the number of lines in a buffered chunk and then split it into smaller mini chunks for parallel tokenization and parsing (horizontal parallelization). I’ve not tried both yet, but I think these are feasible if Julia supports more sophisticated multithreading.

Of course full specialization will result in faster code, but the compilation will slow it down!

Full specialization for a data scheme and a set of parser parameters may improve the performance. However, I think the effect would be modest because modern CPUs are enough smart to lessen the harm of unnecessary code blocks thanks to branch prediction and so on.

2 Likes

Have you considered horizontal parallelism with a heuristic for synchronizing rows independently between chunks rather than a non parallel pre-scan? This might sound crazy, but I’ve had great success with this on binary stream data in the past.

The idea is that a worker task can read a chunk of character data from the middle of the file (independent of all other tasks), use a heuristic to synchronize the character stream with the record (ie, row) boundaries, and continue tokenizing / parsing from there.

Copying data from chunked outputs into the result arrays can be done in parallel as well, so there’s almost zero synchronization or serial processing involved.

The obvious downside is that you need code to save where the record boundaries were inferred for each chunk, check that they match between chunks and have a consistent fallback code path for difficult cases. As long as the heuristic works with high probability this isn’t a performance issue on average, but it’s potentially a fiddly and annoying code path to maintain. To get the most out of the parallelism you also need any compressed input stream to be seekable which generally requires extra complexity (for example gzip->dictzip).

I should say I don’t know if this will work well for CSV! I bring it up because it’s inherently parallel and has worked surprisingly well for me in the past.

every time I see this word I think “replace it with a critics using reinforcement learning techniques!”. Could be an interesting research project.

It’s probably fair to say a heuristic is an inference made with incomplete information (and hence an uncertain result). So yes you could replace a handcrafted heuristic with a learned model.

On that note you might find the following interesting:
https://arxiv.org/abs/1712.01208

1 Like

I reran my entire benchmark suite with the latest version of all the CSV packages, you can see the results here.

I should also say that I can replicate that TableReader.jl is the fastest package for the parking-citations.csv file currently. That file in particular must trigger some code path in the packages that is not yet tested by my benchmark suite, and I’m currently trying to narrow it down what exactly that is. I have a suspicion, but it will require a bit more digging to really pin it down, and then I hope to add another test case to the benchmark suite that has that particular pattern in it.

I now also have some results for a “cold” read here. For those results I load the package with using in a new julia process, and then time the very first load of the file, so this includes JIT compile time etc.

Here are my thoughts on next steps for TextParse.jl/CSVFiles.jl on the performance side of things:

  • I think there is still generally a lot of room for improving simply the core parsing routines. They are very good at this point already, but there are so many more things one could experiment with.
  • I really like the LRU string cache idea that @bicycle1885 is using in TableReader.jl, and I think that would actually be a very nice fit for the internal TextParse.jl structure, so I’ll probably investigate whether I can add that.
  • Once PARTR is out, I plan to introduce multi-threading. The design I have in mind is essentially what @c42f outlined above, I think that is the right strategy for a CSV parser. One thing I really like about TextParse.jl is that it is written in an entirely functional, side-effect free style. Apart from the final vector for the results there is no global or mutable state to worry about!
  • I am pretty convinced at this point that the whole idea of compiling schema specific parsing kernels was a great idea to test out, but ultimately is not the right strategy for many common scenarios. I have a design in my head that I think is a relatively small change, hopefully will get rid of most of the extra compile time, and I think will help a lot with first use experience. The design I have in my mind will keep the compiled kernels around for some special cases where I think they make a lot of sense. This part is more a hypothesis at this point, I might be wrong about the whole approach, we’ll see :slight_smile:
8 Likes

These benchmarks look very useful for developers, but I am not sure I walk away with a clear picture of the relative performances of all these packages. Maybe it would be useful to benchmark these packages on a set of real world CSV from economics/finance/biology etc, all with approx 1,000,000 rows?

Another suggestion is to have the results of the cold read by default.

What about Arrow for Julia?

and ASDF.jl

It would be nice to have a streaming csv-to-whatever converter, where the whatever format is something fast and typed properly.

Parquet would be nice since it plays well with the big data environment. Feather would be fine as well.

Once the file is converted (just once) then there’s no more struggling with CSV. If we think along this direction then ideally all published data sets should be converted to something more user friendly so we don’t have to deal with CSV oddities :wink:

2 Likes

Looking forward to an update to CSV.jl 0.5 which seems to be quite a bit faster and more robust than before.

2 Likes

Will do once I’m done with my current travels in a few days!

4 Likes