[ANN] New CSV.jl 0.5 Release

I’m pleased to announce a new release of the CSV.jl package.

This release provides notable improvements in several areas, including performance, additional features, and enhanced flexibility.

Notable improvements:

  • “Perfect” column typing: gone are the days of rows_for_type_detect and parsing getting messed up after 10K rows. CSV.jl now gets column types right every time, and without needing to restart parsing.
  • Auto delimiter detection: don’t worry about keeping track of which file has which delimiter; CSV.jl will figure it out for you!
  • Better automatic handling of invalid files: invalid values? wrong number of values on a row? CSV.jl will handle such files gracefully, printing helpful messages about anything unexpected it runs into
  • Improved performance: great care has been taken to improve performance on several levels; underlying type parsers (provided by Parsers.jl), better data locality and cache friendliness, and greater use of custom Julia structures for efficiency
  • Enhanced APIs for the CSV.File type: in addition to allowing iteration over rows directly, it now provides getproperty to access efficient read-only columns of the underlying data. If mutable columns are needed, you can copy(col) or use CSV.read(file; copycols=true).
  • And lots and lots of examples in documentation!

These improvements are in addition to many smaller bugfixes and quality of life enhancements. Great effort has been taken to ensure CSV.jl provides a rich set of features, comparable or better than other world-class csv parsers. (see the feature comparison table below!)

As always, please open issues as you run into bugs or performance issues and we’ll try to address things as quickly as possible. Cheers!

CSV.jl R fread Pandas
Char delimiters :heavy_check_mark: :heavy_check_mark: :heavy_check_mark:
String delimiters :heavy_check_mark: :heavy_check_mark:
Regex delimiters :heavy_check_mark:
Fixed-width files :heavy_check_mark: :heavy_check_mark:
Quoted fields :heavy_check_mark: :heavy_check_mark: :heavy_check_mark:
Custom open/close quote characters :heavy_check_mark:
Skip/offset rows to parse :heavy_check_mark: :heavy_check_mark: :heavy_check_mark:
Limit rows to parse :heavy_check_mark: :heavy_check_mark: :heavy_check_mark:
Manually provide column names :heavy_check_mark: :heavy_check_mark: :heavy_check_mark:
Multiple rows as column names :heavy_check_mark: :heavy_check_mark:
Perfect type inference w/o restarting :heavy_check_mark:
Manually specify column types :heavy_check_mark: :heavy_check_mark: :heavy_check_mark:
Specify arbitrary missing strings :heavy_check_mark: :heavy_check_mark: :heavy_check_mark:
“Normalize” column names :heavy_check_mark: :heavy_check_mark: :heavy_check_mark:
Skip rows at end of file :heavy_check_mark: :heavy_check_mark: (python engine only)
Ignore commented rows :heavy_check_mark: :heavy_check_mark:
Handle rows w/ too few/many columns :heavy_check_mark: :heavy_check_mark: :heavy_check_mark:
Read a file transposed :heavy_check_mark:
Custom decimal separator for floats :heavy_check_mark: :heavy_check_mark: :heavy_check_mark:
Custom Bool string values :heavy_check_mark: :heavy_check_mark:
“pool” string column values :heavy_check_mark: :heavy_check_mark:
Control over invalid values :heavy_check_mark: :heavy_check_mark: :heavy_check_mark:
Select/drop specific columns :heavy_check_mark: :heavy_check_mark: :heavy_check_mark:
Apply transform functions :heavy_check_mark: :heavy_check_mark:
Iteration over rows :heavy_check_mark: :heavy_check_mark:
Able to parse Date/DateTime values :heavy_check_mark: :heavy_check_mark:
Support reading any IO object :heavy_check_mark: :heavy_check_mark: :heavy_check_mark:
Progress meter while parsing :heavy_check_mark:
Non-UTF-8 encoded files :heavy_check_mark: :heavy_check_mark:
Multi-threaded parsing :heavy_check_mark:
64 Likes

Great set of examples in the docs!

It would be very useful if you (or someone who participated in this optimization) could write up a short blog post or a summary that links the relevant PRs and discusses what lead to practical improvements. Learning from an actual use case that was optimized by experts is always instructive.

15 Likes

Great and thank you - this is way faster!

Immediately noticed the following parse behaviour change:

428.E+03 is default parsed as String now, used to be as Float64 with previous version of CSV.jl.

If I try to force with types=Dict(Symbol(" S-Mises")=>Float64)) for the column in question, I get warnings/errors like

warnings: error parsing Float64 on row = 193959, col = 13: " 428.E+03,", error=INVALID: OK | SENTINEL | DELIMITED | INVALID_DELIMITER

at all lines that are missing a digit between the . and the E in scientific notation. If I hand edit the source file column value in question to eg. 428.0E+03 the warning/error goes away.

Is this wanted/expected?

Many thanks,
GC

The key trade-off is if you want to work with DataFrames.jl later if you want to materialize a DataFrame using DataFrame or DataFrame! - or equivalently if you use copycols=true vs copycols=false keyword argument in CSV.read (I am not trying to give a comprehensive answer to what @Tamas_Papp wants as I have not developed the change but gives my major consideration when using current design from DataFrames.jl perspective - maybe @quinnj can comment more).

I will use the following DataFrame with 10^7 rows as an example (assuming it is written to disk as testfile.txt).

using Random, DataFrames

Random.seed!(1234)
df = DataFrame(rand(10^7, 10))
df.g = rand(["a", "b", "c"], 10^7)

The difference is loading time between DataFrame and DataFrame! is roughly equal to:

df = DataFrame!(CSV.File("testfile.txt"));
julia> @btime df2 = DataFrame($df);
  1.302 s (74 allocations: 801.09 MiB)

Now when you do some simple by operation you get:

julia> @btime by($df, :g, c=:x5=>sum);
  278.866 ms (155 allocations: 190.74 MiB)

julia> @btime by($df2, :g, c=:x5=>sum);
  120.093 ms (155 allocations: 190.75 MiB)

The short story (from my perspective) is that:

  • if you read in the data and do only a few operations on it use DataFrame! (i.e. copycols=false)
  • if you read in the data and to a lot of different operations on it probably using DataFrame (i.e. copycols=true) will be faster.

Of course another consideration is that DataFrame! returns a read-only DataFrame, so if you want to mutate vectors you have read in you have to use DataFrame constructor.

4 Likes

Thanks for reporting! Would you mind opening an issue on the CSV.jl or Parsers.jl repo? This must have regressed with all the new work that’s gone in, should be a simple fix.

Issue filed in CSV.jl repo. Cheers,
GC

2 Likes

Question: would it be possible to get this to infer?

file = CSV.File("file.csv")
test(it) = first(it).first_column
@code_warntype test(file)

If the column type is determined from the file contents, no.

Use a function barrier.

3 Likes

But isn’t the type of each column is already in file?

CSV.jl defines CSV.getcell(f::CSV.File, T, col, row) which would be inferrable for individual values. It also doesn’t require iteration. You can get the types for a file by doing CSV.gettypes(f).

1 Like

@quinnj Great update so far, really enjoying using it! Just one small thing I ran into: When my file ends with a line, containing only one comment, a line of missings gets added at the end.
MWE:

shell> cat test.csv
a,b,c
1,2,3
4,5,6
# Comment

julia> using CSV

julia> CSV.read("test.csv", comment="#")
3×3 DataFrames.DataFrame
│ Row │ a       │ b       │ c       │
│     │ Int64⍰  │ Int64⍰  │ Int64⍰  │
├─────┼─────────┼─────────┼─────────┤
│ 1   │ 1       │ 2       │ 3       │
│ 2   │ 4       │ 5       │ 6       │
│ 3   │ missing │ missing │ missing │

Is this an issue or an error on my part?

Ah, that sounds like a bug! And I think I know what the fix should be for it. Hold tight.

Ok, fix is up in a PR here: https://github.com/JuliaData/CSV.jl/pull/440. Once tests pass, I’ll merge and tag a new patch release.

6 Likes

Whoa, that went quick! Thank you, keep up the great work!

Just an update here for those interested: CSV.jl now has multithreaded parsing support on current #master branch, based on the new multithreading support in Julia version 1.3. I wrote up a quick blogpost with some benchmarking comparisons vs. R’s fread and pandas. I’m happy to report that CSV.jl has built upon previous performance efforts and the multithreaded case is competitive with fread. It’s a great time for csv reading in Julia! Check out the blogpost here: Everyone’s Favorite Blogpost: CSV Benchmarks – Traitement de Données.

(also, it looks like I can’t edit the original feature comparison table in the post above, so you’ll have to visualize for yourself that CSV.jl now has a check in the “Multi-threaded parsing” box :wink: ).

37 Likes

Are there any plans of creating a package similar to data.table? Its syntax and reference semantics are very efficient.

2 Likes

Which parts do you like? I am a big data.table user.

df[,.N, by1] is definitely handy but I prefer dplyr for easier to read code sometimes. DataFramesMeta.jl is the closest thing. Avoid Query.jl if you want fast group-by performance.

So glad you all got the column typing down!

This is huge and makes the library/ecosystem for datascience feel so much stronger. Looking forward to more developments.

3 Likes