[ANN] CSV.jl 0.7 Release

Thanks for the info and a pity for the change, which the user now has to do again. A little comfort for the user can only be helpful…!
This is of course only my limited view as a Data Analyst: In most cases the user will read the CSV file and store it in a DataFrame. So far I have found it very pleasant that Julia behaves like R at this point:

# R:
> Data <- read.csv2("mtcars.csv")
> str(Data)
'data.frame':   32 obs. of  12 variables:
 $ X   : chr  "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
 $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
 $ cyl : int  6 6 4 6 8 6 8 4 4 6 ...
...
# Julia:
dateiname = "mtcars.csv"
Daten = CSV.read(dateiname, header = 1, delim = ';', ...... )
show(Daten, allcols = true)

There are a lot of benefits to not requiring DataFrames in CSV. All you have to do now is call DataFrame after CSV.File, which is not a very big change.

# Julia:
dateiname = "mtcars.csv"
Daten = CSV.File(datainame, header = 1, delim = ';') |> DataFrame

Besides, I find myself calling as_tibble() all the time in R, which is quite similar.

1 Like

Yes of course, but that is not the point (from my point of view). This small - and possibly helpful from a technical point of view - change makes it necessary to change the code and documentation used. And I’m sure some users will wonder if this was really necessary.

A particular example, where the dependency on DataFrames was a problem, was when DataFrames 0.21 came out, which had a few breaking changes. CSV had a new release, which updated DataFrames to this version, but since that was not a breaking change in CSV, this was tagged as 0.6.2, meaning full compatibility with all previous 0.6 versions. Since people were unknowingly relying on the DataFrames API though, this broke some of their code. With the dependency on DataFrames made explicit now, it will be much easier just to restrict DataFrames to a specific version and CSV doesn’t have to worry about breakage in DataFrames

I’d just like to point out that such decisions are usually not made haphazardly and a lot of thought goes into making as little breaking changes as possible. For packages to evolve though, sometimes breaking changes are needed to provide a better experience in the end.

2 Likes

Seems like a lot of stuff is broken. Does dropmissing! not work anymore? When reading a CSV dat = DataFrame!(CSV.File("./Leinhardt.csv")) I have as column types…

4-element Array{Type,1}:
 Int64
 Union{Missing, Float64}
 String
 String

Then I drop my missing values dropmissing!(dat, disallowmissing=true) where the kwarg suggests that column types will be T instead of Union{T, Missing}. But this dosn’t work anymore. Running eltypes(dat) still returns

4-element Array{Type,1}:
 Int64
 Union{Missing, Float64}
 String
 String

How can I make the second column of type T?


Edit


GLM.jl is breaking too, but not sure if its because of CSV changes.

## define a linear model
lmod = lm(@formula(infant ~ logincome), dat)
MethodError: no method matching fit(::Type{LinearModel}, ::Array{Float64,2}, ::Array{Float64,2}, ::Bool)

Maybe its a problem with the new datatype

> dat.infant
101-element SentinelArrays.SentinelArray{Float64,1,Float64,Missing,Array{Float64,1}}:

That might be due to the SentinelVectors CSV now uses under the hood. You should be able to do DataFrame(CSV.File(...)) (w/o the !), to get back the old behavior of making a copy.

That does not work either. I’ve tried using CSV.read as well and no dice.

There seems to be a bug in similar(::SentinelArray, T, ...): it returns a SentinelArray with eltype Union{T, Missing}.

2 Likes

Thanks so much for your work on this. Just FYI, I fall into the category of data analysts who does not use DataFrames, and I think decoupling the two makes perfect sense. If users want to read directly to DataFrames, it makes sense to me that the function that does that should be in DataFrames, not CSV.

6 Likes

Not quite; the whole uncompressed data indeed has to live in memory, but there’s no duplicate made for CSV.File; only the output column arrays are allocated in addition to the file buffer. The ideal solution is to unzip the input to an uncompressed file on disk, then pass the uncompressed filename to CSV.File, which will mmap the input buffer; this lessens overall memory pressure since the OS can swap mmapped pages in as needed while parsing.

The overhead is not really performance, just memory; you basically just allocate a single contiguous array, then copyto! all the chains. The overhead of using ChainedVector is surprisingly not bad; in my tests it was rarely more than a 2x hit on indexing, which would very quickly get swamped by other computational costs in whatever processing you’re doing. I also have some ideas sketched out to make common operations multithreaded over the chains; it makes a lot of sense because you get the ChainedVectors from a multithreaded csv parsing scenario, which means the data is largish, and they’re naturally split into thread-friendly chains. I think it’ll be a fun/interesting experiment to see what kind of performance boosts we can get over regular Arrays.

The other thought here is that as you get up to really large datasets, you have to switch over at some point to processing data in “batches”, so this is hopefully a way to ease the transition up, along with the new CSV.Chunks functionality.

The other reason for this change is to hopefully encourage people to use CSV.File directly if the full DataFrame functionality isn’t needed. There are a lot of workflows that just need to load data into column arrays, do some quick statistical processing on a few columns, then output the results somewhere. This can easily be achieved by just operating on CSV.File columns directly instead of needing to even use DataFrames at all. A huge advantage of Julia is we don’t have to only have a single, one-size-fits-all solution to these kinds of problems (e.g. “you have to load everything into a DataFrame”).

Thanks for the report @affans, this has been fixed in a patch release to SentinelArrays.jl. Not sure what the GLM.jl issue is you reported, but feel free to open an issue there and we can look into it (from first glance, it doesn’t seem related to CSV.jl changes).

5 Likes

So for e.g. pointer arrays the extra memory wouldn’t really be a problem since the pointer array is much smaller than the sum of the content of the array?

In what way are the chains more thread friendly than a normal array? You have the separate chains because they need to grow while they are being created in the parser but once they are done, having a bunch of separately allocated chains seems worse than just a big array (that you can of course “chunk” by dividing up the indices)?

Thanks again for linking to that PR. It was a very interesting exercise to work through it in detail.

The lesson I learned from it for general code is that manually “unrolling” scenarios for limited types with an if ... elseif ... which is ideally type stable within branches can be very efficient. Very neat.

Also, I looked at the code for

and found it very instructive. Specifically, when I experimented with constructs like this I always ran into the problem of choosing sentinels when something like NaN is not available. Just randomizing that choice and changing it on demand is very elegant.

2 Likes

You need to be a bit careful when it comes to thread safety for this (since it is a shared “global” value in the array.

1 Like

Yes, as noted in the docs, mutating operations (i.e. setindex!) are not thread-safe in the case where you might run into sentinel collisions, which is only really a problem for <: Integer eltypes. It’s a pretty annoying corner-case, but I haven’t seen an easy way around it. It’s another reason I’m hopeful we can move away from SentinelArrays.jl eventually.

4 Likes