DataFrames 0.11 released

After a long and complex development period, we are glad to announce that version 0.11.0 of DataFrames has been released. Among other features listed in the release notes, the major change introduced by this version is the move from the NA value (from the DataArrays package) to the new missing value (from the Missings package, and soon in Base).

DataFrames have been completely decoupled from DataArrays: the DataFrame constructor will no longer convert columns to DataArrays, but will keep them as they are. DataFrame columns can therefore be either plain Vector{T} objects (without support for missing values), Vector{Union{T, Missing}} (supporting missing values), DataVector{T} (by creating such vectors manually), or any other AbstractVector object. Thanks to improvements in the compiler, Vector{Union{T, Missing}} uses an efficient storage similar to DataArray on Julia 0.7, and should generally behave like DataVector{T}. The latter type can still be used for optimal performance, especially on Julia 0.6 (but also on 0.7, since not all Vector{Union{T, Missing}} optimizations have been implemented yet).

As part of the separation of features into independent packages, PooledDataArray has been deprecated in favor of either CategoricalArray or PooledArray. Indeed, PooledDataArray suffered from a lack of clarity regarding its goals: it was at the same time a way to efficiently store data with a small number of unique values, a way to represent categorical data, and it always supported missing values. Categorical data should now be stored using CategoricalArray, which supports both nominal and ordinal variables and allows comparing elements using operators such as <. Non-categorical data with a small number of unique values should be stored using the PooledArray type. These two types can either accept missing values or not, depending on the needs.

Functions to import/export CSV (readtable and writetable) have been deprecated in favor of CSV.read and CSV.write from the CSV package. This allows sharing code and combining our efforts with all other packages working with data.

Finally, modeling features have been moved to a separate StatsModels package. This difference should only be visible to authors of modeling packages, which should now use that package instead of depending on DataFrames. The objective is to allow modeling packages to support any type of data structure automatically.

The porting process should be relatively straightforward. Deprecation warnings are printed, keeping the current code working in many cases (but unfortunately not all cases). NA should be replaced with missing, NAType with Missing and isna with ismissing everywhere. Functions dispatching on DataArray or AbstractDataArray should use AbstractArray{Union{T, Missing}} or AbstractArray{>:Missing} instead, which will match (among others) DataArray{T}. The na.rm=true argument should be replaced with skipmissing, e.g. sum(skipmissing(x)). PooledDataArray should be replaced with either CategoricalArray or PooledArray, which will require some adjustements to the code using such arrays. Code using modeling functions should call using StatsModels first. See the DataFrames manual for a short introduction to missing and CategoricalArray.

We hope that this new, more modular framework will allow for a better interaction between all packages in the data ecosystem. It should pave the road for future improvements to DataFrames and related packages. However, updating all packages to the new framework will take time. A list tracking progress is available here. Your help is welcome! Please also report any bugs you may find.

Also note that until all packages on your local installation have been ported to DataFrames 0.11.0, they will keep requiring version 0.10.1, and the package manager will not update DataFrames to version 0.11.0. If removing the problematic dependencies is not an option, you can use a separate Julia package directory to test the new framework: just set the JULIA_PKGDIR before starting Julia, and run Pkg.add("DataFrame").

62 Likes

Hi! I haven’t read the post yet but before that I’d like to say that it may be worth it for you to do a small post at the “Community” section of the forum. For example myself (and maybe other people) are subscribed there for “announcements” and maybe not in Domains/Data. I am thankful I was tagged by Chris on gitter, otherwise I would never know about this…

Fantastic work by so many people! Can’t wait to try this out and port over my packages.

4 Likes

I’ve pinned the post globally for one week, that should be enough for anybody interested to notice it.

1 Like

Thank you so much @nalimilan. The work you and the rest of the Data people (sensu latu) are doing means so much for this community!

8 Likes

This looks great!

Does anyone have a recap of what the data space looks like now? It seems like the new DataFrames now should be solid going forward. How does it compare to IndexedTables.jl or Pandas.jl both in performance and features? I’m sure someone has been benchmarking master but I can’t find a good thread on it.

1 Like

:tada: Christmas did come early this year! :fireworks:

1 Like

I guess the Julia data frame is similar to the R data frame, but now R has a new data format called tibble, and I really would like to see the Julia version of tibble.

@Yifan_Liu can you comment on the advantages of tibble compared to dataframes?

Thank you @nalimilan for the great work.

@Yifan_Liu tibble is just another form of data.frame. It is actually just a data.frame in the backend but it defines methods on top of data.frames so that it prints nicer and have more sensible defaults. Put simply, tibble is just a more sensible interface on top of data.frames. DataFrames.jl isn’t built into Julia like data.frames is built into R. So DataFrames.jl’s equivalent in the R world is data.frames + some packages on top of data.frames. E.g. if all the functionalities in tibble is implemented in DataFrames.jl then DataFrames.jl becomes the equivalent of data.frames + tibble.

So it would be more useful to pinpoint the specific features in an R package that you see as beneficial and doesn’t already exists in DataFrames.jl.

6 Likes

Does that means that DataArrays starts to be abandoned?

(Doesn’t sure about because it is going to have similar changes from NA/Null to Missing and from PooledDataArray to CategoricalArray, etc…)

Could not be DataArray changed to alias for AbstractArray{Union{T, Missing}} or something similar to save duplicate effort?

No, DataArrays continues to work as it did before (it was even clean up a bit in the new release). Only PooledDataArray is deprecated.

DataArray should not be turned into an alias for AbstractArray{Union{T, Missing}}: if we did that, we would essentially remove DataArray. We hope that at some point Array{Union{T, Missing}} will be as efficient as DataArray{T} so that the latter won’t be needed anymore, but for now it is still useful for performance.

This is fantastic, thank you!

Could anyone in the know tell me what is the status with IterableTables.jl / Query.jl support? (I’m aware of the PR-s but they have been quiet for a while now.)

And a recap as asked by @ChrisRackauckas would indeed be very welcome.

1 Like

4 posts were split to a new topic: Compatibility of Query and Union{T, Missing}

A post was split to a new topic: Interoperability between R and JUlia

@nalimilan Fantast work, Thank you very much! It’s basic and import for Julia data science, especially for users from R :grinning:

I think the major issue was that DataFrames was on such an accelerated path of improvement that any work saved by keeping DataArrays meant far greater losses in potential additions and enhancements. I’ve been following and weakly contributing for months now and, in my opinion, it was the right thing to do.

When will I be able to get it via Pkg.update()? For now I am still on 0.10.1 and I don’t get an update via Pkg.update().