What's the difference between CSV.jl and CSVFiles.jl?

Juan · November 24, 2018, 7:51pm

What’s the difference between CSV.jl and CSVFiles.jl?
When should I use each of them?

davidanthoff · November 24, 2018, 9:15pm

They are two independent CSV readers. You should use whichever suits your needs better

I maintain CSVFilesjl, so I can probably better speak to that one. Here are some things I like about it:

It is part of the larger Queryverse.jl file IO story, which gives you a nice uniform API not just for CSV files, but also ExcelFiles.jl, FeatherFiles.jl, StatFiles.jl (Stata, SPSS, and SAS files), and ParquetFiles.jl. All of that is documented here.
It works with any source or sink that implements the TableTraits.jl interface. When I last counted that was something like 21 packages.
It uses TextParse.jl under the hood, which is fast. And getting faster, there are a bunch of things on master and in branches that are not yet reflected in the benchmarks I posted there.
It is a mature package, by julia standards. Its been around for about 1.5 years. It gets a continues stream of improvements, but the basic structure has been settled and battle tested since the beginning. It has been a very stable story, and I expect that to stay that way going forward, i.e. I generally try very hard to not break things, and the track record so far has been pretty good, I think
You can read gz compressed files out of the box. I’m just mentioning this here, because in a hilarious twist the package has supported that for a long time, but I only found out a few weeks ago See here.

Juan · November 24, 2018, 11:08pm

Thank you.
Does it deal well with missings?
Do you suggest using directly CSVFiles.jl or TextParsejl or Queryverse.jl to read and write large csv files?
And what about binary large binary files?

davidanthoff · November 24, 2018, 11:40pm

Yes, it should deal well with missing values. In fact, the missing handling of the CSVFiles.jl/TextParse.jl combo is especially robust: even if the type detection algorithm of the CSV parser initially doesn’t recognize that a column can contain missing values, it will still all work if it then comes across a missing value somewhere else in the file.

Right now there is a small overhead of using CSVFiles.jl vs raw TextParse.jl, but I think overall it is not too bad, so I would still probably use CSVFiles.jl because it is more convenient. Once the next version of DataFrames.jl is released, that overhead will be gone (you can already get that behavior by using the master branch of DataFrames.jl today).

Queryverse.jl is just a meta-package that loads (https://github.com/queryverse/CSVFiles.jl) and a lot of other packages. If you only want to use (https://github.com/queryverse/CSVFiles.jl), I would just load that. But in any case, you will get the same functionality/code, no matter which of these two packages you load.

I like to use FeatherFiles.jl for a binary table format. The same caveat re DataFrames.jl applies there as well, you’ll get much better performance with DataFrames#master today.

quinnj · November 30, 2018, 6:19am

I put together a quick comparison of supported features between CSVFiles.jl & CSV.jl:

	CSV.jl	CSVFiles.jl
`Char` delimiters
`String` delimiters
Space delimiter
Ignoring any repeated delimiter (fix-width files)
`Char` quote characters
Separate open/close quote characters
escape characters
skip rows to parse
limit rows to parse
handle files without header row
manually provide column names
specify arbitrary row or range of rows for column headers
specify # of rows to use for type inference
automatically sample entire file for type inference
manually specify column types by index or name
parse all columns of one type as another
specify arbitrary missing value strings
transform column names into valid julia identifiers
specify arbitrary row where “data” begins
skip parsing # of rows at end of file
specify comment character/string to ignore commented rows
read a file transposed (rows as columns, columns as rows)
support alternative decimal separator characters (`3.14` vs. `3,14`)
specify arbitrary values to parse as `true` or `false` `Bool` values
auto detect and parse columns as `CategoricalArray`
option to ignore invalid values (replaced with `missing`)
ability to selectively parse specific columns
ability to apply arbitrary scalar transform functions while parsing

CSV.jl has been around for a long time now (started May 2015) vs. CSVFiles.jl (June 2017), and developed a ton of requested features. Please chime in if there’s anything missing!

davidanthoff · November 30, 2018, 7:08pm

That is super helpful!

Two additional features that come to mind that CSVFiles.jl supports is loading data directly from a URL and native support for gz compressed files. There are a couple of WIP PRs over at TextParse.jl that will add a few more features from @quinnj’s list to CSVFiles.jl. Oh, and I guess one other feature of TextParse.jl is the pretty robust column promotion stuff: if your type detection algorithm classifies a column say as integer, but then later in the file you come across a float, it will still just work. Same for missing data: if the type detection algorithm classifies a column to not have missing data, everything still works if later on missing data appears. The promotion is not perfect at this point, i.e. there are some important cases that it doesn’t support, but especially the missing value story is nice and works well, IMO.

FredC · October 25, 2019, 8:10pm

I did some benchmarks and CSVFiles is an order of magnitude more resource intensive than CSV, and twice as slow even on a second run, i.e. after being compiled.

I’m lumping Queryverse in with the Julia Computing “products” as “considered harmful” packages that should be taken out of the registry.

StefanKarpinski · October 25, 2019, 8:55pm

Your notion of what role Julia Computing plays in the Julia ecosystem seems like it may be misguided. This blog post may help clarify:

xiaodai · October 25, 2019, 9:38pm

It is pretty well-established now that CSV.jl is the best CSV reader in Julia.

For Queryverse, I know I wouldn’t use Query.jl because it has poor group_by performance.

oxinabox · October 25, 2019, 9:39pm

No need to be rude.
Just because something doesn’t work for your use-case doesn’t mean it is some kind of antipattern.

Queryverse doesn’t fit my workflow, but I know many others find it useful.
And it is much more than CSVFiles.
You can even use CSV.jl with Queryverse, without issues.

Further, given CSV.jl is according to recent benchmark one of the all time fastest CSV parsers. Getting to be ony 2x slower is solid

And we need competition to drive innovation.

Don’t be rude

FredC · October 26, 2019, 12:21am

I apologize. While I’m not a fan of what I’ve used from Queryverse, it hasn’t been the kind of time-burning bait-and-switch I’ve experienced with the Julia Computing packages.

I’m absolutely baffled at the way things like JuliaDB are being recommended and marketed despite being severely broken. There’s something pyrrhic going on there and the Julia community needs to be vigilant, in my opinion.

xiaodai · October 26, 2019, 2:47am

Have u tried JDF.jl?

davidanthoff · October 26, 2019, 3:45am

What resources are you referring to? Main memory, or something else? What OS are you on?

mkborregaard · October 27, 2019, 2:58pm

If you’re talking about JuliaDB in particular, then the main developer started a PhD and left the project. The other main contributors also contribute a lot to things like Julia itself, so the development of that package has stalled somewhat. That happens occasionally for all open source projects. FWIW it is my impression that the issues experienced by JuliaDB were due to the package trying to ambitiously push some technological boundaries.

In my view JuliaComputing plays a unanimously positive role in furthering julia and supporting the community. If there are particular things you think the community needs to be vigilant about, could you specify it then? That makes it a lot easier to engage with.

Juan · October 27, 2019, 7:10pm

What alternative do we have instead of JuliaDB?

oxinabox · October 27, 2019, 8:18pm

What alternative do we have instead of JuliaDB?

So so many. Dending what you want
I think there are like over a dozen packages that support the Tables.jl interface.
Though many are specific purpose, like for various file loading,
e.g. Feather.jl, CSV.jl etc
Some are for databases, like SQLLite.jl, and LibPQ.jl
And some are general purpose dataframes,
Like DataFrames.jl, TypedTables.jl, Pandas.jl, StructArrays.jl, and IndexedTables.jl (which is a large part of the core of JuliaDB).
And that is not to mention the standard row table: Vector or NamedTuples, and standard column table: NamedTuple of Vectors.

Of course it really depends what you are using JuliaDB or something else for.

Juan · October 27, 2019, 8:37pm

I meant something ready to work alongside packages such as OnlineStats.jl or able to be used to perform easily statistical ondisk operations, reshape tables, add columns and rows, compute by groups or somethig more complex not fitting on memory.

oxinabox · October 27, 2019, 9:32pm

I suspect you can use CSV.jl in Rows mode for that.
@quinnj ?

Though you really are describing the things that are more or less JuliaDB’s features.
So if you haven’t confirmed that it doesn’t work for you then I would definately be trying that first,
and raising issues at least so he problems can be tracked.

simonbyrne · October 28, 2019, 5:16pm

9 posts were split to a new topic: Why do you use JuliaDB?

Juan · January 28, 2020, 12:24am

What package let’s you read a file using a “select=columnnames” option to select the columns you want?

The examples I’ve seen so far do read all columns and makes selection later.

df = CSV.File(“cool_file.csv”) |> select(:a, :b) |> DataFrame

This doesn’t seem efficient and won’t alloy you to read files larger than memory.

or

f = CSV.File(file)
for row in f
println(“a=(row.a), b=(row.b)”)
end

Is there an option on any Julia package to read only the desired columns?

Topic		Replies	Views
CSV Reading (rewrite in C?) Internals & Design	50	5068	October 1, 2018
My experiences reading CSVs from the Fannie Mae datasets Data performance , csv	62	6143	August 26, 2019
[ANN] TableReader.jl - A fast and simple CSV parser Package Announcements package , announcement , data , csv	24	5884	March 28, 2019
Alternative to DataFrame Readtable to read large data files with headers Data	17	4042	November 12, 2018
CSV.read extremely slow wrt readtable Data	14	3638	July 27, 2018

What's the difference between CSV.jl and CSVFiles.jl?

Related topics