[ANN] CSV.jl 0.7 Release

quinnj · June 27, 2020, 6:11pm

I’m happy to announce a new 0.7 release of the CSV.jl package. This is the last anticipated release before an official 1.0 release, and can be considered a 1.0 “trial run”. Major changes include:

Deprecations

CSV.read(file; kw...) has been deprecated in favor of DataFrame!(CSV.File(file; kw...))
The categorical keyword argument has been deprecated; pooled columns will be returned as PooledArray by default; users can produce CategoricalArrays themselves by loading CategoricalArrays and calling categorical(col)
CSV.File(io::Union{Cmd, IO}) has been deprecated in favor of CSV.File(read(io))
writeheader has been deprecated in CSV.write in favor of header=false

New features and improvements

CSV.File now produces fully mutable columns instead of read-only CSV.Column; the column types returned will either be plain Vector{T} (if no missing values were encountered while parsing), SentinelVector{T} (if missing values were encountered), or ChainedVector{T} for multithreaded parsing (each chain is a chunk of the file parsed by a separate thread). No matter the column type, they support all mutable operations and can be treated just like plain Vector{T}.
Custom types can now be passed in the type and types keyword arguments; previously, only Int64, Float64, String, Bool, Date, DateTime, and Time types were supported; you can now pass any type; default fast parsing is supported for all Integer and AbstractFloat types; other custom types need to support zero(T) and Base.parse(T, str) methods to be parsed correctly
String columns are now fully materialized by default (i.e. Vector{String}); for slightly faster parsing times and to avoid allocating every string, you can pass lazystrings=true and String columns will be returned as a custom LazyStringVector array type. Note that LazyStringVector does not support mutating operations (push!, append!, or even setindex!). It also holds a reference to the input file buffer, which means trying to modify the file can lead to undefined behavior in the LazyStringVector.
The limit keyword argument is now experimentally supported when also multithreaded parsing
A new tasks keyword argument controls how many threads/tasks are spawned in multithreaded parsing
A new CSV.Chunks object for iterating over chunks of large files; it accepts all the same arguments as CSV.File, but uses the tasks keyword argument to split the file into tasks # of chunks. Iterating CSV.Chunks produces tasks iterations of CSV.File objects. This functionality is considered experimental, so please file issues if you run into bugs.
Performance and memory footprint should both be improved; please file issues if you find significant regressions

Thanks for all the feedback and suggestions over the years; it’s been fun to see CSV.jl become a state-of-the-art package supporting more features and better performance than other libraries in other languages.

-Jacob

juliohm · June 27, 2020, 6:23pm

I think someone will ask this question at some point, so let me ask it right way…

What is the rationale for this deprecation? Does it mean that new Julia users will have to call DataFrame! as suggested above?

Thank you for sharing the update.

quinnj · June 27, 2020, 6:43pm

Sure; the fact is CSV.jl doesn’t rely on DataFrames.jl for anything and the only code in the package is CSV.read which is just a “glue” convenience function. If Julia had better conditional/optional dependencies, it would be a good use-case. As-is, it about doubles the total number of CSV.jl dependencies, so it makes a lot of sense to decouple the two packages.

Another reason is that CSV.File now returns mutable columns by default, which means a lot of workloads/use-cases can probably just use CSV.File + column access directly if they don’t need additional DataFrames.jl functionality.

Hope that helps explain!

juliohm · June 27, 2020, 7:08pm

Thank you for clarifying it. I thought that DataFrames.jl was very lightweight nowadays with just the core types, and that it was a reasonable dependency for CSV.jl. This tiny glue code will be missed by less experienced users, that is fore sure.

Perhaps it is reasonable to redirect advertising efforts to more high-level packages that contain all the necessary glue code for data science. It is too much to ask for a new user of the language to install and learn two packages just for reading a CSV from disk.

Thank you for the work put into CSV.jl and other related projects, it is impressive and super useful. I am looking forward to the first major release.

lungben · June 27, 2020, 7:36pm

Thank you for this great package and your updates!

A bit off-topic, but completely agreed. A first-class support for conditional/ optional dependencies in Julia would be hugely beneficial. Having small “glue code” to other packages would significantly increase usability of many packages, especially for newcomers (your case is actually one of the simpler ones).

evanfields · June 27, 2020, 7:38pm

I was somewhat surprised to hear this since I remember this blog post from julialang.org which wasn’t super optimistic about sentinels for missing values. I’m somewhat less confused after reading your blog post on CSV.jl data structures, but nonetheless: could you say a bit more about this? Are SentinelVectors always used for missingness? What about for column types like integers and booleans where all bit patterns are reasonable values? Does this pattern support float columns with distinct NaN and missing?

(Also, would like to echo other commenters here: CSV.jl is an amazing piece of work, and the community is far richer for it.)

quinnj · June 27, 2020, 9:09pm

Yeah, I guess I maybe should have done a follow up blogpost on what was decided . I definitely went back and forth on the SentinelVector{T} vs. Vector{Union{T, Missing}} debate. As the primary author of the Union{T, Missing} Array optimizations, I definitely wanted to go that route; the main (current) roadblock is there doesn’t exist an operation like convert(Vector{T}, ::Vector{Union{T, Missing}}). This is crucial in csv parsing, because we always need to assume we might run into a missing value until we’re all finished parsing an entire file. This is the primary use/advantage of SentinelVector because underneath is a plain Vector{T} and if there were no missings, we just return parent(A). Easy peasy. In discussions with a few core developers (Jeff and Jameson mainly), we think it’d be possible to support some kind of operation like this, and I actually intend to look into what it would take; but for now, the easiest route was with SentinelVector.

All that said, I will mention that the “sentinel vector” approach is already what CSV.jl has been using for the last few releases. The difference with 0.7 is that the structure was been formalized, moved to a dedicated package, and now supports all the normal mutable operations you want when working with arrays.

But SentinelVectors are not always used for missingness; Bool columns and if you pass a small integer type (i.e. Int8, Int16, Int32 and their unsigned counterparts), we’ll just used a plain Vector{Union{T, Missing}} and convert as necessary.

So my hope is that SentinelVectors are a small implementation detail for the moment that people can use as regular arrays until we can figure out an appropriate alternative.

pdeffebach · June 27, 2020, 9:34pm

Can you give an MWE for when we encounter SentinelVector{T}?


pkg> st
Status `~/Documents/Projects/Senate Voting/Project.toml`
  [336ed68f] CSV v0.7.0
  [a93c6f00] DataFrames v0.21.3

julia> using DataFrames, CSV

julia> df = DataFrame();

julia> df.a = [rand() < .2 ? missing : rand() for i in 1:1000];

julia> df.b = [rand() < .2 ? missing : rand() for i in 1:1000];

julia> CSV.write("test.csv", df);

julia> df2 = CSV.File("test.csv") |> DataFrame;

julia> eltype.(eachcol(df2))
2-element Array{Union,1}:
 Union{Missing, Float64}
 Union{Missing, Float64}

I thought that DataFrames.jl was very lightweight nowadays with just the core types, and that it was a reasonable dependency for CSV.jl.

This has been mentioned as a possibility, but no concrete action has been taken to split out the core types. DataFrames.jl currently contains all of the select, transform, join, vcat etc. code.

nalimilan · June 27, 2020, 9:56pm

Newcomers can use StatsKit if they don’t want to bother about loading packages separately. using StatsKit; DataFrame!(CSV.File(...)) is quite simple and explicit. DataFrame(CSV.File(...)) is also OK if one doesn’t want to explain what ! means: it just makes a copy, and uses plain Vectors to store columns.

nalimilan · June 27, 2020, 9:56pm

You need to use DataFrame! to get the special types, DataFrame makes a copy and therefore changes types.

quinnj · June 27, 2020, 10:08pm

Successfully fooled! You’re just checking eltype, which for SentinelVector is indeed Union{T, Missing}. If you instead check typeof, you’ll see the SentinelVectors.

pdeffebach · June 27, 2020, 10:40pm

You got me!

I have not found a way to break this yet by having some conversion fail or a method error, so that’s good.

I was thinking some sort of conversion like DataFrame that doesn’t perform a copy unless you have a SentinalArray for missings would be a good idea. But considering that it might “just work” thanks to its design, that seems a bit preemptive.

clinton · June 28, 2020, 2:47am

Hi, thanks for the updates. I was wondering what is the best way to work with, or replace TranscodingStreams- previously I did something like the below to open up a compressed CSV, but this now gives multiple depreciation warnings:
open(p) |> ZstdDecompressorStream |> CSV.read |> DataFrame

gives:

┌ Warning: `CSV.read(input; kw...)` is deprecated in favor of `DataFrame!(CSV.File(input; kw...))`
└ @ CSV C:\Users\Clinton\.julia\packages\CSV\URGyF\src\CSV.jl:40
┌ Warning: `CSV.File` or `CSV.Rows` with `TranscodingStreams.TranscodingStream{ZstdDecompressor,IOStream}` object is deprecated; pass a filename, `IOBuffer`, or byte buffer directly (via `read(x)`)
└ @ CSV C:\Users\Clinton\.julia\packages\CSV\URGyF\src\utils.jl:227

I understand moving to CSV.File, but I don’t understand the proposed replacement for using ZstdDecompressorStream.

Thanks!

quinnj · June 28, 2020, 3:52am

Yeah, I guess it gets a little muddy when you have multiple deprecation warnings going on at the same time. The simplest solution in your case is probably just do:

open(p) |> ZstdDecompressorStream |> read |> CSV.File |> DataFrame!

This incorporates the deprecation feedback by calling read on your IO input argument (ZstdCompressorStream), passing that to CSV.File, then calling DataFrame! without making a copy of the columns.

Alternatively, you could use the array API like:

read(p) |> x->transcode(ZstdDecompressor, x) |> CSV.File |> DataFrame!

You might replace the initial read(p) with Mmap.mmap(p) too. It might be worth playing around with all these options and see which was is fastest.

Hope that helps!

clinton · June 28, 2020, 4:29am

Ah great, that all worked, thanks.

Using read at the beginning with the array API was faster than using open by about 20%. Using Mmap.mmap(p) didn’t make a difference in terms of performance.

Tamas_Papp · June 28, 2020, 9:15am

A big thanks to you and the other contributors for all the great work that went into this release!

Whenever I see significant performance improvements in some Julia package, I am always curious about the implementation details. While I know that all changes are public on Github, the developers’ perspective would be very interesting — if you have the time, please write a blog post on what made this release faster.

simeonschaub · June 28, 2020, 9:35am

Have you read https://github.com/JuliaData/CSV.jl/pull/639? I think, that’s probably where the biggest improvements came from.

sijo · June 28, 2020, 10:10am

But then the whole uncompressed data has to live both in the read array and in the CSV.File? Doesn’t this double the required amount of memory compared to real stream processing?

kristoffer.carlsson · June 28, 2020, 10:41am

What’s the typical overhead in materializing a Vector from a ChainVector compared to parsing the CSV file? Subsequent computations with ChainVector is likely to be slower (unless you use a transducer approach).

Tamas_Papp · June 28, 2020, 11:30am

Thanks for the pointer — this is very useful, but I still think that a blog-style writeups of experience from significant refactorings would be interesting to read (in general, not just for this package).

Topic		Replies	Views
CSV Reading (rewrite in C?) Internals & Design	50	5068	October 1, 2018
Disable SentinelArrays for CSV.read General Usage csv	8	1356	June 1, 2023
[ANN] New CSV.jl 0.5 Release Package Announcements data , csv	18	5079	October 20, 2019
What's the difference between CSV.jl and CSVFiles.jl? New to Julia	25	8111	January 29, 2020
CSV read performance vs Pandas General Usage	29	8158	May 6, 2019

[ANN] CSV.jl 0.7 Release

Deprecations

New features and improvements

Related topics