[ANN] New CSV.jl 0.5 Release

quinnj · May 4, 2019, 1:53pm

I’m pleased to announce a new release of the CSV.jl package.

This release provides notable improvements in several areas, including performance, additional features, and enhanced flexibility.

Notable improvements:

“Perfect” column typing: gone are the days of rows_for_type_detect and parsing getting messed up after 10K rows. CSV.jl now gets column types right every time, and without needing to restart parsing.
Auto delimiter detection: don’t worry about keeping track of which file has which delimiter; CSV.jl will figure it out for you!
Better automatic handling of invalid files: invalid values? wrong number of values on a row? CSV.jl will handle such files gracefully, printing helpful messages about anything unexpected it runs into
Improved performance: great care has been taken to improve performance on several levels; underlying type parsers (provided by Parsers.jl), better data locality and cache friendliness, and greater use of custom Julia structures for efficiency
Enhanced APIs for the CSV.File type: in addition to allowing iteration over rows directly, it now provides getproperty to access efficient read-only columns of the underlying data. If mutable columns are needed, you can copy(col) or use CSV.read(file; copycols=true).
And lots and lots of examples in documentation!

These improvements are in addition to many smaller bugfixes and quality of life enhancements. Great effort has been taken to ensure CSV.jl provides a rich set of features, comparable or better than other world-class csv parsers. (see the feature comparison table below!)

As always, please open issues as you run into bugs or performance issues and we’ll try to address things as quickly as possible. Cheers!

	CSV.jl	R fread	Pandas
`Char` delimiters
`String` delimiters
`Regex` delimiters
Fixed-width files
Quoted fields
Custom open/close quote characters
Skip/offset rows to parse
Limit rows to parse
Manually provide column names
Multiple rows as column names
Perfect type inference w/o restarting
Manually specify column types
Specify arbitrary missing strings
“Normalize” column names
Skip rows at end of file			(python engine only)
Ignore commented rows
Handle rows w/ too few/many columns
Read a file transposed
Custom decimal separator for floats
Custom Bool string values
“pool” string column values
Control over invalid values
Select/drop specific columns
Apply transform functions
Iteration over rows
Able to parse Date/DateTime values
Support reading any IO object
Progress meter while parsing
Non-UTF-8 encoded files
Multi-threaded parsing

cormullion · May 4, 2019, 4:37pm

Great set of examples in the docs!

Tamas_Papp · May 4, 2019, 4:54pm

It would be very useful if you (or someone who participated in this optimization) could write up a short blog post or a summary that links the relevant PRs and discusses what lead to practical improvements. Learning from an actual use case that was optimized by experts is always instructive.

glwc · May 4, 2019, 5:09pm

Great and thank you - this is way faster!

Immediately noticed the following parse behaviour change:

428.E+03 is default parsed as String now, used to be as Float64 with previous version of CSV.jl.

If I try to force with types=Dict(Symbol(" S-Mises")=>Float64)) for the column in question, I get warnings/errors like

warnings: error parsing Float64 on row = 193959, col = 13: " 428.E+03,", error=INVALID: OK | SENTINEL | DELIMITED | INVALID_DELIMITER

at all lines that are missing a digit between the . and the E in scientific notation. If I hand edit the source file column value in question to eg. 428.0E+03 the warning/error goes away.

Is this wanted/expected?

Many thanks,
GC

bkamins · May 4, 2019, 5:12pm

The key trade-off is if you want to work with DataFrames.jl later if you want to materialize a DataFrame using DataFrame or DataFrame! - or equivalently if you use copycols=true vs copycols=false keyword argument in CSV.read (I am not trying to give a comprehensive answer to what @Tamas_Papp wants as I have not developed the change but gives my major consideration when using current design from DataFrames.jl perspective - maybe @quinnj can comment more).

I will use the following DataFrame with 10^7 rows as an example (assuming it is written to disk as testfile.txt).

using Random, DataFrames

Random.seed!(1234)
df = DataFrame(rand(10^7, 10))
df.g = rand(["a", "b", "c"], 10^7)

The difference is loading time between DataFrame and DataFrame! is roughly equal to:

df = DataFrame!(CSV.File("testfile.txt"));
julia> @btime df2 = DataFrame($df);
  1.302 s (74 allocations: 801.09 MiB)

Now when you do some simple by operation you get:

julia> @btime by($df, :g, c=:x5=>sum);
  278.866 ms (155 allocations: 190.74 MiB)

julia> @btime by($df2, :g, c=:x5=>sum);
  120.093 ms (155 allocations: 190.75 MiB)

The short story (from my perspective) is that:

if you read in the data and do only a few operations on it use DataFrame! (i.e. copycols=false)
if you read in the data and to a lot of different operations on it probably using DataFrame (i.e. copycols=true) will be faster.

Of course another consideration is that DataFrame! returns a read-only DataFrame, so if you want to mutate vectors you have read in you have to use DataFrame constructor.

quinnj · May 4, 2019, 5:55pm

Thanks for reporting! Would you mind opening an issue on the CSV.jl or Parsers.jl repo? This must have regressed with all the new work that’s gone in, should be a simple fix.

glwc · May 4, 2019, 6:29pm

Issue filed in CSV.jl repo. Cheers,
GC

bramtayl · May 5, 2019, 5:22pm

Question: would it be possible to get this to infer?

file = CSV.File("file.csv")
test(it) = first(it).first_column
@code_warntype test(file)

Tamas_Papp · May 6, 2019, 6:38am

If the column type is determined from the file contents, no.

Use a function barrier.

bramtayl · May 6, 2019, 3:37pm

But isn’t the type of each column is already in file?

quinnj · May 6, 2019, 3:47pm

CSV.jl defines CSV.getcell(f::CSV.File, T, col, row) which would be inferrable for individual values. It also doesn’t require iteration. You can get the types for a file by doing CSV.gettypes(f).

simeonschaub · May 22, 2019, 5:04pm

@quinnj Great update so far, really enjoying using it! Just one small thing I ran into: When my file ends with a line, containing only one comment, a line of missings gets added at the end.
MWE:

shell> cat test.csv
a,b,c
1,2,3
4,5,6
# Comment

julia> using CSV

julia> CSV.read("test.csv", comment="#")
3×3 DataFrames.DataFrame
│ Row │ a       │ b       │ c       │
│     │ Int64⍰  │ Int64⍰  │ Int64⍰  │
├─────┼─────────┼─────────┼─────────┤
│ 1   │ 1       │ 2       │ 3       │
│ 2   │ 4       │ 5       │ 6       │
│ 3   │ missing │ missing │ missing │

Is this an issue or an error on my part?

quinnj · May 22, 2019, 5:15pm

Ah, that sounds like a bug! And I think I know what the fix should be for it. Hold tight.

quinnj · May 22, 2019, 5:21pm

Ok, fix is up in a PR here: https://github.com/JuliaData/CSV.jl/pull/440. Once tests pass, I’ll merge and tag a new patch release.

simeonschaub · May 22, 2019, 5:46pm

Whoa, that went quick! Thank you, keep up the great work!

quinnj · August 24, 2019, 2:17pm

Just an update here for those interested: CSV.jl now has multithreaded parsing support on current #master branch, based on the new multithreading support in Julia version 1.3. I wrote up a quick blogpost with some benchmarking comparisons vs. R’s fread and pandas. I’m happy to report that CSV.jl has built upon previous performance efforts and the multithreaded case is competitive with fread. It’s a great time for csv reading in Julia! Check out the blogpost here: Everyone’s Favorite Blogpost: CSV Benchmarks – Traitement de Données.

(also, it looks like I can’t edit the original feature comparison table in the post above, so you’ll have to visualize for yourself that CSV.jl now has a check in the “Multi-threaded parsing” box ).

Yifan_Liu · August 24, 2019, 6:37pm

Are there any plans of creating a package similar to data.table? Its syntax and reference semantics are very efficient.

xiaodai · October 20, 2019, 2:41am

Which parts do you like? I am a big data.table user.

df[,.N, by1] is definitely handy but I prefer dplyr for easier to read code sometimes. DataFramesMeta.jl is the closest thing. Avoid Query.jl if you want fast group-by performance.

anon92994695 · October 20, 2019, 2:44am

So glad you all got the column typing down!

This is huge and makes the library/ecosystem for datascience feel so much stronger. Looking forward to more developments.

Topic		Replies	Views
[ANN] CSV.jl 0.7 Release Data	38	5399	July 18, 2020
CSV read performance vs Pandas General Usage	29	8239	May 6, 2019
CSV Reading (rewrite in C?) Internals & Design	50	5131	October 1, 2018
CSV.read extremely slow wrt readtable Data	14	3655	July 27, 2018
What's the difference between CSV.jl and CSVFiles.jl? New to Julia	25	8185	January 29, 2020

[ANN] New CSV.jl 0.5 Release

Related topics