What's the difference between CSV.jl and CSVFiles.jl?

Thank you.
I was surprised, CSVFiles is 4.5 times fasted for 1.2Gb files (164000*2500 dataframe).

1 Like

I added something like that to CSVFiles.jl recently:

load("foo.csv", colparsers=Dict(:colA=>nothing, :colC=>nothing)) |> DataFrame

Essentially when you assign nothing as the colparser for a given column, it will be skipped entirely.

What I don’t have yet is a nice (positive) column selection API. My goal is to make

load("foo.csv") |> @select(:colA, :colB)

work with this, i.e. even though it would look as if you are selecting columns after they are read, the design of Query.jl and CSVFiles.jl is such that I can get this to never actually read any column other than colA and colB. The goal is to support the full column selection story from the @select command.

3 Likes

Sorry to interfere, but is it possible to have the same functionality for rows as well? Like ignore rows which fails some tests at the read time?
May be something like this

load("foo.csv", rowparsers=Dict(:colA => x -> x > 0)) |> DataFrame

One obvious applications is csv with comment lines, another is that loading only subset of data may be more memory efficient.

1 Like

For comments, you can already skip reading them via the commentchar keyword (and I just realised that wasn’t documented, I just updated the CSVFiles.jl README).

My plan for arbitrary row filtering is two-fold: I hope to add a true streaming mode of operation to CSVFiles.jl. At that point you could just write:

load("foo.csv", hypothetical_streaming_flag=true) |>
  @filter(_.colA ==3) |>
  DataFrame

and then that would stream things row-by-row, applying the filter per row before things get materialised into the `DataFrame.

4 Likes

But Unfortunately, CSVFiles.jl write 4 time slower, =((

So, to read use CSVFiles.jl, to write use CSV.jl

Not sure what is going on there, I’ll take a look. At some point they were pretty similar. Right now CSV.jl is much faster on writing numbers because it uses a better floating point write algorithm (I think), but I think the example at the top shouldn’t be affected by that… Not sure.

3 Likes