What's the difference between CSV.jl and CSVFiles.jl?

What’s the difference between CSV.jl and CSVFiles.jl?
When should I use each of them?

3 Likes

They are two independent CSV readers. You should use whichever suits your needs better :slight_smile:

I maintain CSVFilesjl, so I can probably better speak to that one. Here are some things I like about it:

  • It is part of the larger Queryverse.jl file IO story, which gives you a nice uniform API not just for CSV files, but also ExcelFiles.jl, FeatherFiles.jl, StatFiles.jl (Stata, SPSS, and SAS files), and ParquetFiles.jl. All of that is documented here.
  • It works with any source or sink that implements the TableTraits.jl interface. When I last counted that was something like 21 packages.
  • It uses TextParse.jl under the hood, which is fast. And getting faster, there are a bunch of things on master and in branches that are not yet reflected in the benchmarks I posted there.
  • It is a mature package, by julia standards. Its been around for about 1.5 years. It gets a continues stream of improvements, but the basic structure has been settled and battle tested since the beginning. It has been a very stable story, and I expect that to stay that way going forward, i.e. I generally try very hard to not break things, and the track record so far has been pretty good, I think :slight_smile:
  • You can read gz compressed files out of the box. I’m just mentioning this here, because in a hilarious twist the package has supported that for a long time, but I only found out a few weeks ago :slight_smile: See here.
9 Likes

Thank you.
Does it deal well with missings?
Do you suggest using directly CSVFiles.jl or TextParsejl or Queryverse.jl to read and write large csv files?
And what about binary large binary files?

Yes, it should deal well with missing values. In fact, the missing handling of the CSVFiles.jl/TextParse.jl combo is especially robust: even if the type detection algorithm of the CSV parser initially doesn’t recognize that a column can contain missing values, it will still all work if it then comes across a missing value somewhere else in the file.

Right now there is a small overhead of using CSVFiles.jl vs raw TextParse.jl, but I think overall it is not too bad, so I would still probably use CSVFiles.jl because it is more convenient. Once the next version of DataFrames.jl is released, that overhead will be gone (you can already get that behavior by using the master branch of DataFrames.jl today).

Queryverse.jl is just a meta-package that loads (https://github.com/queryverse/CSVFiles.jl) and a lot of other packages. If you only want to use (https://github.com/queryverse/CSVFiles.jl), I would just load that. But in any case, you will get the same functionality/code, no matter which of these two packages you load.

I like to use FeatherFiles.jl for a binary table format. The same caveat re DataFrames.jl applies there as well, you’ll get much better performance with DataFrames#master today.

2 Likes

I put together a quick comparison of supported features between CSVFiles.jl & CSV.jl:

CSV.jl CSVFiles.jl
Char delimiters :white_check_mark: :white_check_mark:
String delimiters :white_check_mark:
Space delimiter :white_check_mark: :white_check_mark:
Ignoring any repeated delimiter (fix-width files) :white_check_mark:
Char quote characters :white_check_mark: :white_check_mark:
Separate open/close quote characters :white_check_mark:
escape characters :white_check_mark: :white_check_mark:
skip rows to parse :white_check_mark: :white_check_mark:
limit rows to parse :white_check_mark:
handle files without header row :white_check_mark: :white_check_mark:
manually provide column names :white_check_mark: :white_check_mark:
specify arbitrary row or range of rows for column headers :white_check_mark:
specify # of rows to use for type inference :white_check_mark:
automatically sample entire file for type inference :white_check_mark:
manually specify column types by index or name :white_check_mark: :white_check_mark:
parse all columns of one type as another :white_check_mark:
specify arbitrary missing value strings :white_check_mark: :white_check_mark:
transform column names into valid julia identifiers :white_check_mark:
specify arbitrary row where “data” begins :white_check_mark:
skip parsing # of rows at end of file :white_check_mark:
specify comment character/string to ignore commented rows :white_check_mark:
read a file transposed (rows as columns, columns as rows) :white_check_mark:
support alternative decimal separator characters (3.14 vs. 3,14) :white_check_mark:
specify arbitrary values to parse as true or false Bool values :white_check_mark:
auto detect and parse columns as CategoricalArray :white_check_mark:
option to ignore invalid values (replaced with missing) :white_check_mark:
ability to selectively parse specific columns :white_check_mark:
ability to apply arbitrary scalar transform functions while parsing :white_check_mark:

CSV.jl has been around for a long time now (started May 2015) vs. CSVFiles.jl (June 2017), and developed a ton of requested features. Please chime in if there’s anything missing!

21 Likes

That is super helpful!

Two additional features that come to mind that CSVFiles.jl supports is loading data directly from a URL and native support for gz compressed files. There are a couple of WIP PRs over at TextParse.jl that will add a few more features from @quinnj’s list to CSVFiles.jl. Oh, and I guess one other feature of TextParse.jl is the pretty robust column promotion stuff: if your type detection algorithm classifies a column say as integer, but then later in the file you come across a float, it will still just work. Same for missing data: if the type detection algorithm classifies a column to not have missing data, everything still works if later on missing data appears. The promotion is not perfect at this point, i.e. there are some important cases that it doesn’t support, but especially the missing value story is nice and works well, IMO.

3 Likes

I did some benchmarks and CSVFiles is an order of magnitude more resource intensive than CSV, and twice as slow even on a second run, i.e. after being compiled.

I’m lumping Queryverse in with the Julia Computing “products” as “considered harmful” packages that should be taken out of the registry.

Your notion of what role Julia Computing plays in the Julia ecosystem seems like it may be misguided. This blog post may help clarify:

4 Likes

It is pretty well-established now that CSV.jl is the best CSV reader in Julia.

For Queryverse, I know I wouldn’t use Query.jl because it has poor group_by performance.

No need to be rude.
Just because something doesn’t work for your use-case doesn’t mean it is some kind of antipattern.

Queryverse doesn’t fit my workflow, but I know many others find it useful.
And it is much more than CSVFiles.
You can even use CSV.jl with Queryverse, without issues.

Further, given CSV.jl is according to recent benchmark one of the all time fastest CSV parsers. Getting to be ony 2x slower is solid

And we need competition to drive innovation.

Don’t be rude

29 Likes

I apologize. While I’m not a fan of what I’ve used from Queryverse, it hasn’t been the kind of time-burning bait-and-switch I’ve experienced with the Julia Computing packages.

I’m absolutely baffled at the way things like JuliaDB are being recommended and marketed despite being severely broken. There’s something pyrrhic going on there and the Julia community needs to be vigilant, in my opinion.

Have u tried JDF.jl?

What resources are you referring to? Main memory, or something else? What OS are you on?

If you’re talking about JuliaDB in particular, then the main developer started a PhD and left the project. The other main contributors also contribute a lot to things like Julia itself, so the development of that package has stalled somewhat. That happens occasionally for all open source projects. FWIW it is my impression that the issues experienced by JuliaDB were due to the package trying to ambitiously push some technological boundaries.

In my view JuliaComputing plays a unanimously positive role in furthering julia and supporting the community. If there are particular things you think the community needs to be vigilant about, could you specify it then? That makes it a lot easier to engage with.

11 Likes

What alternative do we have instead of JuliaDB?

What alternative do we have instead of JuliaDB?

So so many. Dending what you want
I think there are like over a dozen packages that support the Tables.jl interface.
Though many are specific purpose, like for various file loading,
e.g. Feather.jl, CSV.jl etc
Some are for databases, like SQLLite.jl, and LibPQ.jl
And some are general purpose dataframes,
Like DataFrames.jl, TypedTables.jl, Pandas.jl, StructArrays.jl, and IndexedTables.jl (which is a large part of the core of JuliaDB).
And that is not to mention the standard row table: Vector or NamedTuples, and standard column table: NamedTuple of Vectors.

Of course it really depends what you are using JuliaDB or something else for.

1 Like

I meant something ready to work alongside packages such as OnlineStats.jl or able to be used to perform easily statistical ondisk operations, reshape tables, add columns and rows, compute by groups or somethig more complex not fitting on memory.

I suspect you can use CSV.jl in Rows mode for that.
@quinnj ?

Though you really are describing the things that are more or less JuliaDB’s features.
So if you haven’t confirmed that it doesn’t work for you then I would definately be trying that first,
and raising issues at least so he problems can be tracked.

2 Likes

9 posts were split to a new topic: Why do you use JuliaDB?

What package let’s you read a file using a “select=columnnames” option to select the columns you want?

The examples I’ve seen so far do read all columns and makes selection later.

df = CSV.File(“cool_file.csv”) |> select(:a, :b) |> DataFrame

This doesn’t seem efficient and won’t alloy you to read files larger than memory.

or

f = CSV.File(file)
for row in f
println(“a=(row.a), b=(row.b)”)
end

Is there an option on any Julia package to read only the desired columns?