Thank you.
I was surprised, CSVFiles is 4.5 times fasted for 1.2Gb files (164000*2500 dataframe).
I added something like that to CSVFiles.jl recently:
load("foo.csv", colparsers=Dict(:colA=>nothing, :colC=>nothing)) |> DataFrame
Essentially when you assign nothing
as the colparser for a given column, it will be skipped entirely.
What I don’t have yet is a nice (positive) column selection API. My goal is to make
load("foo.csv") |> @select(:colA, :colB)
work with this, i.e. even though it would look as if you are selecting columns after they are read, the design of Query.jl and CSVFiles.jl is such that I can get this to never actually read any column other than colA
and colB
. The goal is to support the full column selection story from the @select
command.
Sorry to interfere, but is it possible to have the same functionality for rows as well? Like ignore rows which fails some tests at the read time?
May be something like this
load("foo.csv", rowparsers=Dict(:colA => x -> x > 0)) |> DataFrame
One obvious applications is csv with comment lines, another is that loading only subset of data may be more memory efficient.
For comments, you can already skip reading them via the commentchar
keyword (and I just realised that wasn’t documented, I just updated the CSVFiles.jl README).
My plan for arbitrary row filtering is two-fold: I hope to add a true streaming mode of operation to CSVFiles.jl. At that point you could just write:
load("foo.csv", hypothetical_streaming_flag=true) |>
@filter(_.colA ==3) |>
DataFrame
and then that would stream things row-by-row, applying the filter per row before things get materialised into the `DataFrame.
But Unfortunately, CSVFiles.jl write 4 time slower, =((
So, to read use CSVFiles.jl, to write use CSV.jl
Not sure what is going on there, I’ll take a look. At some point they were pretty similar. Right now CSV.jl is much faster on writing numbers because it uses a better floating point write algorithm (I think), but I think the example at the top shouldn’t be affected by that… Not sure.