I’m excited to announce that my new package, TableReader.jl, is registered as an official package. You can now install it by add TableReader
in the package management mode of REPL.
As the title says, this is a new CSV parser written for Julians. The expected response is “Why yet another? We already have several packages to read CSV!”. Yes, you are absolutely right. We already have CSV.jl, CSVFiles.jl, CSVReader.jl, TextParse.jl, and more. But please give me a moment to explain the motivation of the package.
The features of TableReader.jl include:
- Quick startup.
- High-performance data loading.
- Robust to very wide tables.
- Able to read various data sources.
Let’s see these features one by one from the top.
Quick start-up time
Here is a comparison of start-up delays. The data file contains only two rows with three columns and thus it is suitable to measure the initial JIT cost. As you can see, TableReader.jl is faster and requires less memory than the other three packages.
julia> using TableReader
julia> @time readcsv("test/test.csv");
2.231962 seconds (2.82 M allocations: 141.191 MiB, 1.53% gc time)
# Restart session
julia> using CSV, DataFrames
julia> @time DataFrame(CSV.File("test/test.csv"));
7.829930 seconds (33.46 M allocations: 1.398 GiB, 9.08% gc time)
# Restart session
julia> using CSVFiles, DataFrames
julia> @time DataFrame(load("test/test.csv"));
13.138267 seconds (49.00 M allocations: 2.274 GiB, 9.80% gc time)
# Restart session
julia> using TextParse, DataFrames
julia> @time begin
data, names = csvread("test/test.csv")
DataFrame(collect(data), Symbol.(names))
end;
5.404968 seconds (15.94 M allocations: 769.788 MiB, 5.84% gc time)
Fast data loading
TableReader.jl is carefully optimized for speed. Here is a benchmark using six real-world data sets with various characteristics (long vs. wide, homogeneous vs. heterogeneous types, etc.). I used the latest release versions of packages available today. See here for the benchmark scripts and data sets.
The following plot shows the result (run six times for each dataset; excluded the first call). TextParse.jl was not measured because it is internally used by CSVFiles.jl. Apparently, TableReader.jl is the fastest parser in this benchmark set. The missing values are due to parsing error or taking a too long time. So, please note that this result is by no means fair and complete benchmarking. But I believe this represents the usual performance you will see with your data files.
If you are interested in the magic under the hood, please quickly glance at the tokenizer.jl, in which I implemented a scanner (or a tokenizer) to annotate raw text data written in a hand-crafted finite state machine. In addition, caching strings decreases the allocation load of strings that are often consecutive within a column.
Robustness to wide tables
The major problem that drives me to create a new CSV parser is that other packages are not good at handling very wide data files. For example, CSV.jl failed to parse a file composed of ~20,000 columns due to the limit of type inference on Julia 1.1 (Loading data with a lot of columns takes a long time at the first call · Issue #388 · JuliaData/CSV.jl · GitHub). This failure has been solved on the development branch of Julia; however, it still requires a very long time in the first-time call (the 1st call took 186 seconds, and the 2nd took 5 seconds in my experiment). I often use this kind of wide files in research, and taking so long time is not satisfactory to me.
In contrast, TableReader.jl is very robust to wide files. The wide file mentioned above can be loaded within a few seconds with TableReader.jl. This is because TableReader.jl does not generate a schema-specific parser, and thus it requires much less time to start parsing.
Various data sources
TableReader.jl supports not only local files but also remote files. If the filename passed to readcsv
looks like a URL, the function automatically delegates the download to the curl
command and streams the data to the parser.
readcsv("path/to/localfile.csv")
readcsv("https://example.com/path/to/remotefile.csv")
readcsv
automatically detects the compression format if the file is compressed and transparently decompresses data as it reads. Currently, gzip, zstd, and xz are supported:
readcsv("somefile.csv.gz")
readcsv("somefile.csv.zst")
readcsv("somefile.csv.xz")
TableReader.jl can read data from a running process. If your data files are archived in a tarball or a zip file, this feature is very useful because you can read a file from the archive in the following way:
readcsv(`tar -Oxf archive.tar.gz somefile.csv`)
readcsv(`unzip -p archive.zip somefile.csv`)
You can also read data from any I/O objects that support a minimum set of I/O methods. So, for example, if your file contains some metadata in its header, you can read it and then resume CSV parsing like this:
file = open("somefile.csv")
metadata = readline(file)
readcsv(file)
Limitations
I have not yet implemented many features that other packages have. For example, you cannot specify a parser or a type to each column. Also, backslash-escaped strings are not supported yet. I’d like to support these features if you convince me that these features are required in actual cases.
Also, a field length is limited to 16 MB due to the internal encoding of a token. I’ve never seen such a long field that might hit the limit, but this may happen if you try to store very long texts in a CSV file. This limit can be relaxed if it is okay to use more memory, but, at the moment, I have no plan to do because I think this works well in 99.9% of cases.
How to contribute
The easiest contribution to this package is to install and use it. And if you find something you like or dislike, give me feedback here! Any comments, issue reports and pull requests are welcomed. Since I’m not a native speaker of English, I really appreciate corrections and improvements to the docs. Also, please don’t forget to click the star button on GitHub to motivate me to continue to develop the package