DataFrames 0.11 released

nalimilan · November 24, 2017, 8:30pm

After a long and complex development period, we are glad to announce that version 0.11.0 of DataFrames has been released. Among other features listed in the release notes, the major change introduced by this version is the move from the NA value (from the DataArrays package) to the new missing value (from the Missings package, and soon in Base).

DataFrames have been completely decoupled from DataArrays: the DataFrame constructor will no longer convert columns to DataArrays, but will keep them as they are. DataFrame columns can therefore be either plain Vector{T} objects (without support for missing values), Vector{Union{T, Missing}} (supporting missing values), DataVector{T} (by creating such vectors manually), or any other AbstractVector object. Thanks to improvements in the compiler, Vector{Union{T, Missing}} uses an efficient storage similar to DataArray on Julia 0.7, and should generally behave like DataVector{T}. The latter type can still be used for optimal performance, especially on Julia 0.6 (but also on 0.7, since not all Vector{Union{T, Missing}} optimizations have been implemented yet).

As part of the separation of features into independent packages, PooledDataArray has been deprecated in favor of either CategoricalArray or PooledArray. Indeed, PooledDataArray suffered from a lack of clarity regarding its goals: it was at the same time a way to efficiently store data with a small number of unique values, a way to represent categorical data, and it always supported missing values. Categorical data should now be stored using CategoricalArray, which supports both nominal and ordinal variables and allows comparing elements using operators such as <. Non-categorical data with a small number of unique values should be stored using the PooledArray type. These two types can either accept missing values or not, depending on the needs.

Functions to import/export CSV (readtable and writetable) have been deprecated in favor of CSV.read and CSV.write from the CSV package. This allows sharing code and combining our efforts with all other packages working with data.

Finally, modeling features have been moved to a separate StatsModels package. This difference should only be visible to authors of modeling packages, which should now use that package instead of depending on DataFrames. The objective is to allow modeling packages to support any type of data structure automatically.

The porting process should be relatively straightforward. Deprecation warnings are printed, keeping the current code working in many cases (but unfortunately not all cases). NA should be replaced with missing, NAType with Missing and isna with ismissing everywhere. Functions dispatching on DataArray or AbstractDataArray should use AbstractArray{Union{T, Missing}} or AbstractArray{>:Missing} instead, which will match (among others) DataArray{T}. The na.rm=true argument should be replaced with skipmissing, e.g. sum(skipmissing(x)). PooledDataArray should be replaced with either CategoricalArray or PooledArray, which will require some adjustements to the code using such arrays. Code using modeling functions should call using StatsModels first. See the DataFrames manual for a short introduction to missing and CategoricalArray.

We hope that this new, more modular framework will allow for a better interaction between all packages in the data ecosystem. It should pave the road for future improvements to DataFrames and related packages. However, updating all packages to the new framework will take time. A list tracking progress is available here. Your help is welcome! Please also report any bugs you may find.

Also note that until all packages on your local installation have been ported to DataFrames 0.11.0, they will keep requiring version 0.10.1, and the package manager will not update DataFrames to version 0.11.0. If removing the problematic dependencies is not an option, you can use a separate Julia package directory to test the new framework: just set the JULIA_PKGDIR before starting Julia, and run Pkg.add("DataFrame").

Datseris · November 24, 2017, 9:12pm

Hi! I haven’t read the post yet but before that I’d like to say that it may be worth it for you to do a small post at the “Community” section of the forum. For example myself (and maybe other people) are subscribed there for “announcements” and maybe not in Domains/Data. I am thankful I was tagged by Chris on gitter, otherwise I would never know about this…

randyzwitch · November 24, 2017, 9:13pm

Fantastic work by so many people! Can’t wait to try this out and port over my packages.

nalimilan · November 24, 2017, 9:32pm

I’ve pinned the post globally for one week, that should be enough for anybody interested to notice it.

mkborregaard · November 24, 2017, 9:46pm

Thank you so much @nalimilan. The work you and the rest of the Data people (sensu latu) are doing means so much for this community!

ChrisRackauckas · November 24, 2017, 11:26pm

This looks great!

Does anyone have a recap of what the data space looks like now? It seems like the new DataFrames now should be solid going forward. How does it compare to IndexedTables.jl or Pandas.jl both in performance and features? I’m sure someone has been benchmarking master but I can’t find a good thread on it.

yakir12 · November 24, 2017, 11:53pm

Christmas did come early this year!

Yifan_Liu · November 25, 2017, 12:59am

I guess the Julia data frame is similar to the R data frame, but now R has a new data format called tibble, and I really would like to see the Julia version of tibble.

juliohm · November 25, 2017, 2:33am

@Yifan_Liu can you comment on the advantages of tibble compared to dataframes?

Thank you @nalimilan for the great work.

xiaodai · November 25, 2017, 3:02am

@Yifan_Liu tibble is just another form of data.frame. It is actually just a data.frame in the backend but it defines methods on top of data.frames so that it prints nicer and have more sensible defaults. Put simply, tibble is just a more sensible interface on top of data.frames. DataFrames.jl isn’t built into Julia like data.frames is built into R. So DataFrames.jl’s equivalent in the R world is data.frames + some packages on top of data.frames. E.g. if all the functionalities in tibble is implemented in DataFrames.jl then DataFrames.jl becomes the equivalent of data.frames + tibble.

So it would be more useful to pinpoint the specific features in an R package that you see as beneficial and doesn’t already exists in DataFrames.jl.

Liso · November 25, 2017, 2:13pm

Does that means that DataArrays starts to be abandoned?

(Doesn’t sure about because it is going to have similar changes from NA/Null to Missing and from PooledDataArray to CategoricalArray, etc…)

Could not be DataArray changed to alias for AbstractArray{Union{T, Missing}} or something similar to save duplicate effort?

nalimilan · November 25, 2017, 2:16pm

No, DataArrays continues to work as it did before (it was even clean up a bit in the new release). Only PooledDataArray is deprecated.

DataArray should not be turned into an alias for AbstractArray{Union{T, Missing}}: if we did that, we would essentially remove DataArray. We hope that at some point Array{Union{T, Missing}} will be as efficient as DataArray{T} so that the latter won’t be needed anymore, but for now it is still useful for performance.

ValdarT · November 26, 2017, 11:17am

This is fantastic, thank you!

Could anyone in the know tell me what is the status with IterableTables.jl / Query.jl support? (I’m aware of the PR-s but they have been quiet for a while now.)

And a recap as asked by @ChrisRackauckas would indeed be very welcome.

StefanKarpinski · November 28, 2017, 5:40pm

4 posts were split to a new topic: Compatibility of Query and Union{T, Missing}

vchuravy · November 29, 2017, 2:39pm

A post was split to a new topic: Interoperability between R and JUlia

Sabodhapati · November 28, 2017, 3:55pm

@nalimilan Fantast work, Thank you very much! It’s basic and import for Julia data science, especially for users from R

Nectarineimp · November 29, 2017, 6:15pm

I think the major issue was that DataFrames was on such an accelerated path of improvement that any work saved by keeping DataArrays meant far greater losses in potential additions and enhancements. I’ve been following and weakly contributing for months now and, in my opinion, it was the right thing to do.

IljaK91 · December 7, 2017, 11:09am

When will I be able to get it via Pkg.update()? For now I am still on 0.10.1 and I don’t get an update via Pkg.update().

Topic		Replies	Views
Announcement: DataFrames Future Plans Data announcement	27	7938	July 4, 2017
Is there light at the end of the DataFrames tunnel? Data question	36	4300	November 24, 2017
DataTables or DataFrames? Data question	32	15378	November 19, 2018
Announcement: An Update on DataFrames Future Plans Data announcement	41	9248	December 27, 2017
Release announcements for DataFrames.jl Data announcement , dataframes	190	24509	September 28, 2023

DataFrames 0.11 released

Related topics