DataTables or DataFrames?

Hi Natasha,

I’m one of the stats/data developers. In short, I think the answer is DataFrames.

Why DataFrames? At the moment, I think it provides superior usability and more widespread, mature support in the ecosystem, at the cost of slightly poorer performance compared to DataTables.


Now, for a bit of background. Continue at your own risk…

Some of my colleagues may disagree on this, but my take is that at this point, the future of DataFrames and DataTables is somewhat unclear. Originally the plan was to convert DataFrames from a DataArrays backend to a NullableArrays backend, as NullableArrays currently offers somewhat improved performance over DataArrays due to its type stability. NullableArrays uses the Nullable type everywhere, which ensures that the output of a given function has a predictable type. DataArrays effectively uses Union{T, NAtype} for some type T. Since Unions are not (yet!) well optimized by the compiler, the performance lags behind that of NullableArrays.

A while back, the DataFrames master branch made the switch to NullableArrays, and a release was planned. As people began to try out the master branch, it became clear that the change to Nullable was massively breaking, particularly without more widespread package support or infrastructure in place, and that DataFrames became more frustrating to work with. A example of this comes from an issue that sparked my proposal to separate the two packages. We decided to keep the classic DataFrames alive and well going forward and maintain its Nullable counterpart as DataTables. We considered eventually deprecating DataFrames in favor of DataTables, but I, as well as some of my colleagues, still prefer DataFrames and don’t want that to happen.

DataArrays and its NA value and NullableArrays and the Nullable type offer different mental models of data values. A Nullable is actually a container that contains 0 or 1 value. NA, as well as Union{T, NAtype}, is simply a scalar. One might expect, coming from R, SAS, or other statistical software, that one could use NA as a scalar and that it would propagate through arithmetic operations. Using Nullables requires thinking a little differently; currently one must use broadcast to obtain what’s called “lifting,” where an operation returns a null value when passed a null value. This way of thinking is preferred by some and disliked by others.

An enormous amount of thought and care has gone into the discussion for what to do next. This discussion has been spearheaded largely by John Myles White and other JuliaStats developers, with input from the community. We (the developers) are hoping that the next release of Julia will bring optimizations for union types, which will permit optimization of the DataArrays-style approach to missing data, and that the relevant types and operations can be moved into Base.


Now, a word on the state of the ecosystem…

Some packages have (in my opinion, too hastily) adopted DataTables (e.g. CSV and RCall), whereas many are still set up to use DataFrames exclusively (e.g. Gadfly and GLM). We’re hoping that this can eventually be reconsiled by providing a tabular data abstraction that enforces a common API that packages can code against, which would allow users to say using MyFavoriteTable and so long as MyFavoriteTable adheres to the abstract table API, things “just work.” That’s the ultimate goal. There has been some work toward this, but we aren’t there yet.

In the meantime, the best choice of tabular data storage in Julia depends somewhat on your needs, but for general purposes, I’d recommend that users and package authors continue to use and support DataFrames.


I realize that’s an incredibly long-winded answer, but I hope it will prove useful to you and to anyone else sharing your concerns or curiosity.

Regards,
Alex

22 Likes