What's the difference between CSV.jl and CSVFiles.jl?


#1

What’s the difference between CSV.jl and CSVFiles.jl?
When should I use each of them?


#2

They are two independent CSV readers. You should use whichever suits your needs better :slight_smile:

I maintain CSVFilesjl, so I can probably better speak to that one. Here are some things I like about it:

  • It is part of the larger Queryverse.jl file IO story, which gives you a nice uniform API not just for CSV files, but also ExcelFiles.jl, FeatherFiles.jl, StatFiles.jl (Stata, SPSS, and SAS files), and ParquetFiles.jl. All of that is documented here.
  • It works with any source or sink that implements the TableTraits.jl interface. When I last counted that was something like 21 packages.
  • It uses TextParse.jl under the hood, which is fast. And getting faster, there are a bunch of things on master and in branches that are not yet reflected in the benchmarks I posted there.
  • It is a mature package, by julia standards. Its been around for about 1.5 years. It gets a continues stream of improvements, but the basic structure has been settled and battle tested since the beginning. It has been a very stable story, and I expect that to stay that way going forward, i.e. I generally try very hard to not break things, and the track record so far has been pretty good, I think :slight_smile:
  • You can read gz compressed files out of the box. I’m just mentioning this here, because in a hilarious twist the package has supported that for a long time, but I only found out a few weeks ago :slight_smile: See here.

#3

Thank you.
Does it deal well with missings?
Do you suggest using directly CSVFiles.jl or TextParsejl or Queryverse.jl to read and write large csv files?
And what about binary large binary files?


#4

Yes, it should deal well with missing values. In fact, the missing handling of the CSVFiles.jl/TextParse.jl combo is especially robust: even if the type detection algorithm of the CSV parser initially doesn’t recognize that a column can contain missing values, it will still all work if it then comes across a missing value somewhere else in the file.

Right now there is a small overhead of using CSVFiles.jl vs raw TextParse.jl, but I think overall it is not too bad, so I would still probably use CSVFiles.jl because it is more convenient. Once the next version of DataFrames.jl is released, that overhead will be gone (you can already get that behavior by using the master branch of DataFrames.jl today).

Queryverse.jl is just a meta-package that loads (https://github.com/queryverse/CSVFiles.jl) and a lot of other packages. If you only want to use (https://github.com/queryverse/CSVFiles.jl), I would just load that. But in any case, you will get the same functionality/code, no matter which of these two packages you load.

I like to use FeatherFiles.jl for a binary table format. The same caveat re DataFrames.jl applies there as well, you’ll get much better performance with DataFrames#master today.


#5

I put together a quick comparison of supported features between CSVFiles.jl & CSV.jl:

CSV.jl CSVFiles.jl
Char delimiters :white_check_mark: :white_check_mark:
String delimiters :white_check_mark:
Space delimiter :white_check_mark: :white_check_mark:
Ignoring any repeated delimiter (fix-width files) :white_check_mark:
Char quote characters :white_check_mark: :white_check_mark:
Separate open/close quote characters :white_check_mark:
escape characters :white_check_mark: :white_check_mark:
skip rows to parse :white_check_mark: :white_check_mark:
limit rows to parse :white_check_mark:
handle files without header row :white_check_mark: :white_check_mark:
manually provide column names :white_check_mark: :white_check_mark:
specify arbitrary row or range of rows for column headers :white_check_mark:
specify # of rows to use for type inference :white_check_mark:
automatically sample entire file for type inference :white_check_mark:
manually specify column types by index or name :white_check_mark: :white_check_mark:
parse all columns of one type as another :white_check_mark:
specify arbitrary missing value strings :white_check_mark: :white_check_mark:
transform column names into valid julia identifiers :white_check_mark:
specify arbitrary row where “data” begins :white_check_mark:
skip parsing # of rows at end of file :white_check_mark:
specify comment character/string to ignore commented rows :white_check_mark:
read a file transposed (rows as columns, columns as rows) :white_check_mark:
support alternative decimal separator characters (3.14 vs. 3,14) :white_check_mark:
specify arbitrary values to parse as true or false Bool values :white_check_mark:
auto detect and parse columns as CategoricalArray :white_check_mark:
option to ignore invalid values (replaced with missing) :white_check_mark:
ability to selectively parse specific columns :white_check_mark:
ability to apply arbitrary scalar transform functions while parsing :white_check_mark:

CSV.jl has been around for a long time now (started May 2015) vs. CSVFiles.jl (June 2017), and developed a ton of requested features. Please chime in if there’s anything missing!


#6

That is super helpful!

Two additional features that come to mind that CSVFiles.jl supports is loading data directly from a URL and native support for gz compressed files. There are a couple of WIP PRs over at TextParse.jl that will add a few more features from @quinnj’s list to CSVFiles.jl. Oh, and I guess one other feature of TextParse.jl is the pretty robust column promotion stuff: if your type detection algorithm classifies a column say as integer, but then later in the file you come across a float, it will still just work. Same for missing data: if the type detection algorithm classifies a column to not have missing data, everything still works if later on missing data appears. The promotion is not perfect at this point, i.e. there are some important cases that it doesn’t support, but especially the missing value story is nice and works well, IMO.