What's the difference between CSV.jl and CSVFiles.jl?

BMval · January 28, 2020, 12:57am

Thank you.
I was surprised, CSVFiles is 4.5 times fasted for 1.2Gb files (164000*2500 dataframe).

davidanthoff · January 28, 2020, 1:08am

I added something like that to CSVFiles.jl recently:

load("foo.csv", colparsers=Dict(:colA=>nothing, :colC=>nothing)) |> DataFrame

Essentially when you assign nothing as the colparser for a given column, it will be skipped entirely.

What I don’t have yet is a nice (positive) column selection API. My goal is to make

load("foo.csv") |> @select(:colA, :colB)

work with this, i.e. even though it would look as if you are selecting columns after they are read, the design of Query.jl and CSVFiles.jl is such that I can get this to never actually read any column other than colA and colB. The goal is to support the full column selection story from the @select command.

Skoffer · January 28, 2020, 7:58am

Sorry to interfere, but is it possible to have the same functionality for rows as well? Like ignore rows which fails some tests at the read time?
May be something like this

load("foo.csv", rowparsers=Dict(:colA => x -> x > 0)) |> DataFrame

One obvious applications is csv with comment lines, another is that loading only subset of data may be more memory efficient.

davidanthoff · January 28, 2020, 11:16pm

For comments, you can already skip reading them via the commentchar keyword (and I just realised that wasn’t documented, I just updated the CSVFiles.jl README).

My plan for arbitrary row filtering is two-fold: I hope to add a true streaming mode of operation to CSVFiles.jl. At that point you could just write:

load("foo.csv", hypothetical_streaming_flag=true) |>
  @filter(_.colA ==3) |>
  DataFrame

and then that would stream things row-by-row, applying the filter per row before things get materialised into the `DataFrame.

BMval · January 29, 2020, 12:56am

But Unfortunately, CSVFiles.jl write 4 time slower, =((

So, to read use CSVFiles.jl, to write use CSV.jl

davidanthoff · January 29, 2020, 1:35am

Not sure what is going on there, I’ll take a look. At some point they were pretty similar. Right now CSV.jl is much faster on writing numbers because it uses a better floating point write algorithm (I think), but I think the example at the top shouldn’t be affected by that… Not sure.

Topic		Replies	Views
[ANN] New CSV.jl 0.5 Release Package Announcements data , csv	18	5078	October 20, 2019
[ANN] Fread.jl - read CSVs faster with the help of R's {data.table} Package Announcements performance , data , csv	6	2049	October 9, 2019
CSV parsing performance Data	4	1404	July 4, 2017
CSV vs DelimitedFiles vs Numpy Performance	15	959	January 20, 2024
[ANN] TableReader.jl - A fast and simple CSV parser Package Announcements package , announcement , data , csv	24	5871	March 28, 2019

What's the difference between CSV.jl and CSVFiles.jl?

Related topics