[ANN] TableReader.jl - A fast and simple CSV parser

package
announcement
data
csv
#1

I’m excited to announce that my new package, TableReader.jl, is registered as an official package. You can now install it by add TableReader in the package management mode of REPL.

As the title says, this is a new CSV parser written for Julians. The expected response is “Why yet another? We already have several packages to read CSV!”. Yes, you are absolutely right. We already have CSV.jl, CSVFiles.jl, CSVReader.jl, TextParse.jl, and more. But please give me a moment to explain the motivation of the package.

The features of TableReader.jl include:

  • Quick startup.
  • High-performance data loading.
  • Robust to very wide tables.
  • Able to read various data sources.

Let’s see these features one by one from the top.

Quick start-up time

Here is a comparison of start-up delays. The data file contains only two rows with three columns and thus it is suitable to measure the initial JIT cost. As you can see, TableReader.jl is faster and requires less memory than the other three packages.

julia> using TableReader

julia> @time readcsv("test/test.csv");
  2.231962 seconds (2.82 M allocations: 141.191 MiB, 1.53% gc time)

# Restart session
julia> using CSV, DataFrames

julia> @time DataFrame(CSV.File("test/test.csv"));
  7.829930 seconds (33.46 M allocations: 1.398 GiB, 9.08% gc time)

# Restart session
julia> using CSVFiles, DataFrames

julia> @time DataFrame(load("test/test.csv"));
 13.138267 seconds (49.00 M allocations: 2.274 GiB, 9.80% gc time)

# Restart session
julia> using TextParse, DataFrames

julia> @time begin
           data, names = csvread("test/test.csv")
           DataFrame(collect(data), Symbol.(names))
       end;
  5.404968 seconds (15.94 M allocations: 769.788 MiB, 5.84% gc time)

Fast data loading

TableReader.jl is carefully optimized for speed. Here is a benchmark using six real-world data sets with various characteristics (long vs. wide, homogeneous vs. heterogeneous types, etc.). I used the latest release versions of packages available today. See here for the benchmark scripts and data sets.

The following plot shows the result (run six times for each dataset; excluded the first call). TextParse.jl was not measured because it is internally used by CSVFiles.jl. Apparently, TableReader.jl is the fastest parser in this benchmark set. The missing values are due to parsing error or taking a too long time. So, please note that this result is by no means fair and complete benchmarking. But I believe this represents the usual performance you will see with your data files.

If you are interested in the magic under the hood, please quickly glance at the tokenizer.jl, in which I implemented a scanner (or a tokenizer) to annotate raw text data written in a hand-crafted finite state machine. In addition, caching strings decreases the allocation load of strings that are often consecutive within a column.

Robustness to wide tables

The major problem that drives me to create a new CSV parser is that other packages are not good at handling very wide data files. For example, CSV.jl failed to parse a file composed of ~20,000 columns due to the limit of type inference on Julia 1.1 (https://github.com/JuliaData/CSV.jl/issues/388). This failure has been solved on the development branch of Julia; however, it still requires a very long time in the first-time call (the 1st call took 186 seconds, and the 2nd took 5 seconds in my experiment). I often use this kind of wide files in research, and taking so long time is not satisfactory to me.

In contrast, TableReader.jl is very robust to wide files. The wide file mentioned above can be loaded within a few seconds with TableReader.jl. This is because TableReader.jl does not generate a schema-specific parser, and thus it requires much less time to start parsing.

Various data sources

TableReader.jl supports not only local files but also remote files. If the filename passed to readcsv looks like a URL, the function automatically delegates the download to the curl command and streams the data to the parser.

readcsv("path/to/localfile.csv")
readcsv("https://example.com/path/to/remotefile.csv")

readcsv automatically detects the compression format if the file is compressed and transparently decompresses data as it reads. Currently, gzip, zstd, and xz are supported:

readcsv("somefile.csv.gz")
readcsv("somefile.csv.zst")
readcsv("somefile.csv.xz")

TableReader.jl can read data from a running process. If your data files are archived in a tarball or a zip file, this feature is very useful because you can read a file from the archive in the following way:

readcsv(`tar -Oxf archive.tar.gz somefile.csv`)
readcsv(`unzip -p archive.zip somefile.csv`)

You can also read data from any I/O objects that support a minimum set of I/O methods. So, for example, if your file contains some metadata in its header, you can read it and then resume CSV parsing like this:

file = open("somefile.csv")
metadata = readline(file)
readcsv(file)

Limitations

I have not yet implemented many features that other packages have. For example, you cannot specify a parser or a type to each column. Also, backslash-escaped strings are not supported yet. I’d like to support these features if you convince me that these features are required in actual cases.

Also, a field length is limited to 16 MB due to the internal encoding of a token. I’ve never seen such a long field that might hit the limit, but this may happen if you try to store very long texts in a CSV file. This limit can be relaxed if it is okay to use more memory, but, at the moment, I have no plan to do because I think this works well in 99.9% of cases.

How to contribute

The easiest contribution to this package is to install and use it. And if you find something you like or dislike, give me feedback here! Any comments, issue reports and pull requests are welcomed. Since I’m not a native speaker of English, I really appreciate corrections and improvements to the docs. Also, please don’t forget to click the star button on GitHub to motivate me to continue to develop the package :wink:

57 Likes

#2

That code sure is magic: 400 out of 1000 lines contain macros! :slight_smile:

How does the speed compare to R’s fread, which seems to be very fast, see e.g. Reading Data Is Still Too Slow.

0 Likes

#3

To be fair, @goto and @label could have just as well been keywords.

2 Likes

#4

Thank you so much for this. Reading gzipped CSV files has been a long-time outstanding issue for me (CSV.jl#366), and also wide tables (eg Stan output with a lot of parameters). You have solved both! :clap:

11 Likes

#5

Well done, this is pretty amazing!

Should we try to incorporate your benchmarks into https://github.com/davidanthoff/csv-comparison, so that they run routinely with the other benchmarks at https://www.queryverse.org/benchmarks/?

10 Likes

#6

This is awesome.

0 Likes

#7

Thanks for this. I just tested for my case, with a file that is 573328×35 and it’s basically twice as fast as CSV.jl. @btime macro for TableReader.jl showed 1.398 s reading the file, 2.625 s for CSV.jl. Pretty cool.

0 Likes

#8

@alejandromerchan did you also try CSVFiles.jl? I think the situation right now is that TableReader.jl and TextParse.jl (which powers CSVFiles.jl) are each faster for different kind of files.

0 Likes

#9

I now played around a bit with the benchmark that @bicycle1885 showed above. There was a fair bit of stuff ready on master for TextParse.jl for a while that I now tagged in a release that makes a difference. I think it is also interesting to test TextParse.jl directly, rather than CSVFiles.jl. There are some inefficiencies introduced by the latter. I hope to fix them soon, but if we just want to compare raw parsing performance, that is probably more of a distraction.

So, with these caveats, here is what I get (sorry for the mislabeled output, it should say “TextParse.jl” instead of “CSVFiles.jl”). The story is quite mixed: TextParse.jl is faster for the diamonds, parking, tmp0u3qt3mu and winemag dataset, whereas TableReader.jl is ahead for the flights14 dataset. I didn’t even try the dataset with the many columns, because I know that TextParse.jl has no chance on that one currently :slight_smile:

At some level it seems to me that TableReader.jl should probably be classified as “couldn’t read” for the tmp0u3qt3mu dataset, right? I think right now it can only be read if some preprocessing of the file is done? For the results I linked to, I replaced all the tabs in the file with a space, which seems also what you did?

This of course does not take away one bit from the fact that TableReader is without peer right now in terms of first use times and many column files!

My broad takeaway from all of this is probably as follows:

  • I think TextParse.jl still has the fastest core parsing algorithms, maybe with the exception of the string story when the cache story from TableReader kicks in.
  • TableReader.jl clearly gets it right with how things are put together, i.e. the core parsing kernels are exposed in an efficient way to users, whereas for TextParse.jl (and CSVFiles.jl) we currently make quite a number of mistakes in how we put our very fast core parsing algorithms together. What I mean by that is that the way we introduce compiler overhead etc. can easily nullify the fast core parsing routines. I’m pretty positive we can fix that, but we’ll probably have to wait and see whether my optimism is warranted :slight_smile: I’m pretty positive that if my plan there works out it would also solve the many column issue.

In any case, having TableReader.jl is fantastic, both as a tool for users and as a benchmark for other packages to sort out where we can do better!

10 Likes

#10

I haven’t. When I was starting the project a few months ago CSVFiles.jl didn’t work for some reason I really don’t remember and I did all my scripts with CSV.jl. I can certainly do a quick check.

1 Like

#11

Thank you for your feedback. TableReader.jl seems to be working well on your machine.

0 Likes

#12

Thank you for updating TextParse.jl. I’ve updated TextParse.jl and some errors I saw in the previous version have been disappeared (maybe, this issue has been solved?).

I’ve checked your benchmark results. Thank you for doing that. But TableReader.jl seems to be a little bit slower than I expected on your machine while TextParse.jl showed roughly the same speed with yours. For example, parking-citations.csv took 20 seconds with TableReader.jl and 27 seconds with TextParse.jl on my macOS machine. However, the results you benchmarked show 35.7 seconds with TableReader.jl and 24.8 seconds with TextParse.jl. Which operating system did you use for benchmarking? Perhaps, severe slowdown of parsing floating-point numbers on Windows is yet resolved (https://github.com/bicycle1885/TableReader.jl/issues/3). I have no way to check the current status because I have no access to Windows machines.

0 Likes

#13

Interesting! Yes, I ran on Windows. I can rerun on Mac tomorrow, let’s see what I get there.

0 Likes

#14

Very cool!

Out of curiosity: any reason why you hand-coded the state machine instead of using Automata.jl?

0 Likes

#15

I guess you mean Automa.jl? The main reason is that I needed to make the delimiter a parser parameter. Automa.jl does not support parser parameters that are variable at rumtime because it compiles static regular expressions at code-generation time.

2 Likes

#16

I am trying to install TableReader but getting the following error. Should this work already?

(v1.0) pkg> add TableReader
 Resolving package versions...
ERROR: Unsatisfiable requirements detected for package TableReader [70df011a]:
 TableReader [70df011a] log:
 ├─possible versions are: 0.1.0 or uninstalled
 ├─restricted to versions * by an explicit requirement, leaving only versions 0.1.0
 └─restricted by julia compatibility requirements to versions: uninstalled — no versions left

It is not clear how to proceed from that error message.

1 Like

#17

I think you are using Julia 1.0, not Julia 1.1. TableReader.jl is not tested against Julia 1.0 so it is restricted to Julia 1.1 users by the package requirement. If you want to try it now, the quickest way would be to update your Julia to 1.1. However, it should work on Julia 1.0 as well because I think it does not depend on new features of Julia 1.1, so I will relax the restriction soon.

EDIT: TableReader.jl 0.1.1 works on Julia 1.0.

2 Likes

#18

try one of the following:

  1. Pkg.gc()
  2. Pkg.update()
  3. Any combination of the two
0 Likes

#19

I also tried to install it on Julia 1.0.3 and it did not work: got similar error. I then moved on to Julia 1.1 and it works fine.

0 Likes

#20

Yeah, I’m sorry for inconvenience. The situation will change soon once https://github.com/JuliaLang/METADATA.jl/pull/22459 is merged.

0 Likes