Announcement: DataFrames Future Plans

UPDATE: the plan described below is going to be implemented in a different way from what was initially announced. The DataFrames package will remain the same as it is now: instead, the new framework will be provided by the new DataTables package. This will allow for a less disruptive migration of existing code relying on DataFrames.

UPDATE 2: an updated summary is available in this post.

Towards DataFrames 0.9.0

The DataFrames package and the surrounding ecosystem are currently undergoing a deep refactoring in development branches, based on a framework developed over the last two years. This work aims to dramatically improve performance by replacing the DataArray type (and its NA value representing missingness) with the new Nullable, NullableArray (see this blog post) and CategoricalArray types. Please refer to this blog post for an explanation of the limitations of the current design based on DataArray. The new framework is planned to be released as version 0.9.0 in early February 2017

New APIs and Compatibility Breaks

Despite our efforts to preserve backward compatibility, this change will likely break some existing workflows. The standard indexing approach (inherited from R) will no longer be the recommended interface. Instead, convenient, flexible and efficient high-level APIs inspired by the dplyr R package, by SQL or by LINQ will be preferred. Users are encouraged to experiment with these approaches even with the current stable DataFrames release (0.8.x series), via the DataFramesMeta and Query packages. Eventually, an API based on the StructuredQueries package (see this blog post), which is still in development, will be provided. Among other advantages, these high-level APIs will eventually support different data sources, from in-memory data frames to out-of-core databases, with very little code changes.

The new DataFrames release will require adjustments from all packages depending on DataFrames. Until then, development will continue to happen on the master branch of the git repository. In many cases, both the new and the old frameworks can be supported in parallel (by supporting both DataArray and NullableArray): when possible, package authors are encouraged to start porting as soon as possible. The porting work is tracked in a GitHub issue; take inspiration from existing pull requests, and do not hesitate to ask for help there if needed.

Motivated users can also experiment with the development version, though be warned that the user experience can currently be frustrating due to incomplete support for Nullable in Julia and in high-level APIs. This issue, known as “lifting” (see this discussion and this one, as well as linked pages), still requires fundamental changes. We expect these to be complete by early January 2017 to allow for a progressive migration; users are not advised to upgrade to the development version for actual work until then.

More Changes

The above changes will be coordinated with a related refactoring of the DataFrames codebase to increase modularity and :

  • CSV reading and writing support (readtable and writetable) will be deprecated in favor of the CSV package. Data importation and exportation should more generally be done via the DataStreams package (see this blog post).
  • Functions translating model formulas into model matrices will be moved to a separate StatsModels package, with the goal of eventually supporting any kind of AbstractTable (including DataFrame), and will also include model-related functions currently in StatsBase. Though this will not happen in the first release, in the end modeling packages should only need to depend on that package, and no longer on DataFrames.
  • A new AbstractTable interface will be progressively developed in the eponymous package to allow writing generic code supporting any kind of tabular data, including DataFrame, without depending on the DataFrames package.
  • Packages strongly tied to DataFrames (including that package itself) will be moved to the JuliaData organization to keep JuliaStats focused on actual statistics.

We are aware that the transition will certainly be disruptive for users. But we are confident the advantages of the new framework will greatly offset its costs, following state-of-the-art designs like R’s dplyr and Python’s Pandas 2.0, and taking full advantage of Julia’s flexibility and performance. Your help is welcome to push forward with this roadmap!

14 Likes

Perhaps you could give it a different name. Then people could transition in their own time…

Fortunately it’s possible to stick with the older version of a package if you need to. For package developers just put an upper-bound on DataFrames in your REQUIRE file. As a user you can use Pkg.pin to prevent Julia from updating the package until you’re ready.

4 Likes

I actually agree with @cormullion that a new package name might be (or have been?) better. Perhaps even DataFrames2? :wink:

I don’t know. DataFrames is a good name for this. People know the name and will search for it. Deprecating the name in order to help people with an API change pre-1.0 seems like prematurely trying to enforce stability.

1 Like

Using a different package name wouldn’t help much, since DataFrames is a dependency of many other packages which will need to choose which version to use (until we have optional dependencies at least). It would be really confusing to have a DataFrame object which isn’t supported by e.g. Gadfly because it expects a DataFrame type from the other DataFrames package.

On the contrary, pinning the package to the 0.8 version will ensure all dependencies are compatible (if we add upper bounds correctly) until you are ready to make the switch.

1 Like

Okay, fair enough. Cheers!

I think this is worth posting as a blog post on the Julia blog.

Not a R user — what is this referring to? df[df[:something] .== 5] ?

Great news,

I am glad to see progress here, and that it is being communicated like this. Is there a list of issues that need to get done before the release?

I found Milestone 0.9.0 and Issue #1092. Are these the complete lists? Maybe this is a good use for the new Github Projects feature.

-James

Not the best place to ask, please look at the docs or start a new thread.[quote=“jpfairbanks, post:11, topic:266”]
I found Milestone 0.9.0 and Issue #1092. Are these the complete lists? Maybe this is a good use for the new Github Projects feature.
[/quote]

Yes, that’s more or less complete. I don’t think we need to use the Projects feature, as most of the work needs to happen outside of DataFrames now (in particular, in NullableArrays, StructuredQueries and Julia Base).

I’m not sure, as blog posts usually have announced work when it was more or less completed. I’d prefer to wait until we actually release the new framework, which is when we should draw the most attention to it.

5 Likes

What are the advantages of DataTables.jl compared with DataFrames.jl? From the perspective of performance, which package should I choose? Thanks!

The main advantage of DataTables is type stability (at the column level), but whether it will make your code faster will depend on many things, so it’s hard to tell without trying. Also DataTables hasn’t been really optimized yet, though we’re working on it (e.g. this PR).

If you use high-level APIs like Query.jl it should be easy to switch from one framework to another to compare them. If you work directly with column vectors, I strongly recommend using Julia 0.6, where missing values lifting is supported with element-wise operators like .+.

Where is the documentation for DataTables (the links at https://github.com/JuliaData/DataTables.jl result in a 404 (page not found) error).

I wouldn’t just dismiss the smart indexing approach for querying a data table. I think there’s merit in supporting both the indexing & the LINQ/dplyr approach. In R when working with in-memory tables I prefer the indexing approach of data.table both for speed & conciseness. See r - data.table vs dplyr: can one do something well the other can't or does poorly? - Stack Overflow for a discussion, especially the second reply in favor of data.table.

1 Like

@sylvaticus The manual didn’t build correctly until recently because Query.jl needed to be updated. Now there remains a small bug that this PR should fix.

@Steven_Sagaert I didn’t say we dismissed that syntax, but currently it requires some changes when moving between DataFrames and DataTables, so that’s not the best way to compare them. Also the syntax to work with Nullable isn’t stabilized yet, so it’s easier to work with high-level APIs for now. Finally, performance shouldn’t be higher with direct indexing, at least with optimized querying frameworks. (BTW, DataFrames/DataTable does not support data.table-like advanced syntax, only the basic indexing API.)

You can get some information from the man directory.
https://github.com/JuliaData/DataTables.jl/blob/master/docs/src/man/getting_started.md

Another package to consider is IndexedTables, particularly if you like R’s data.table (I do!).

It’s type stable, and the indexing means lookups and joins should be fast.

Note that this package is not registered, yet.

4 Likes