Announcement: DataFrames Future Plans

nalimilan · November 12, 2016, 1:45pm

UPDATE: the plan described below is going to be implemented in a different way from what was initially announced. The DataFrames package will remain the same as it is now: instead, the new framework will be provided by the new DataTables package. This will allow for a less disruptive migration of existing code relying on DataFrames.

UPDATE 2: an updated summary is available in this post.

Towards DataFrames 0.9.0

The DataFrames package and the surrounding ecosystem are currently undergoing a deep refactoring in development branches, based on a framework developed over the last two years. This work aims to dramatically improve performance by replacing the DataArray type (and its NA value representing missingness) with the new Nullable, NullableArray (see this blog post) and CategoricalArray types. Please refer to this blog post for an explanation of the limitations of the current design based on DataArray. The new framework is planned to be released as version 0.9.0 in early February 2017

New APIs and Compatibility Breaks

Despite our efforts to preserve backward compatibility, this change will likely break some existing workflows. The standard indexing approach (inherited from R) will no longer be the recommended interface. Instead, convenient, flexible and efficient high-level APIs inspired by the dplyr R package, by SQL or by LINQ will be preferred. Users are encouraged to experiment with these approaches even with the current stable DataFrames release (0.8.x series), via the DataFramesMeta and Query packages. Eventually, an API based on the StructuredQueries package (see this blog post), which is still in development, will be provided. Among other advantages, these high-level APIs will eventually support different data sources, from in-memory data frames to out-of-core databases, with very little code changes.

The new DataFrames release will require adjustments from all packages depending on DataFrames. Until then, development will continue to happen on the master branch of the git repository. In many cases, both the new and the old frameworks can be supported in parallel (by supporting both DataArray and NullableArray): when possible, package authors are encouraged to start porting as soon as possible. The porting work is tracked in a GitHub issue; take inspiration from existing pull requests, and do not hesitate to ask for help there if needed.

Motivated users can also experiment with the development version, though be warned that the user experience can currently be frustrating due to incomplete support for Nullable in Julia and in high-level APIs. This issue, known as “lifting” (see this discussion and this one, as well as linked pages), still requires fundamental changes. We expect these to be complete by early January 2017 to allow for a progressive migration; users are not advised to upgrade to the development version for actual work until then.

More Changes

The above changes will be coordinated with a related refactoring of the DataFrames codebase to increase modularity and :

CSV reading and writing support (readtable and writetable) will be deprecated in favor of the CSV package. Data importation and exportation should more generally be done via the DataStreams package (see this blog post).
Functions translating model formulas into model matrices will be moved to a separate StatsModels package, with the goal of eventually supporting any kind of AbstractTable (including DataFrame), and will also include model-related functions currently in StatsBase. Though this will not happen in the first release, in the end modeling packages should only need to depend on that package, and no longer on DataFrames.
A new AbstractTable interface will be progressively developed in the eponymous package to allow writing generic code supporting any kind of tabular data, including DataFrame, without depending on the DataFrames package.
Packages strongly tied to DataFrames (including that package itself) will be moved to the JuliaData organization to keep JuliaStats focused on actual statistics.

We are aware that the transition will certainly be disruptive for users. But we are confident the advantages of the new framework will greatly offset its costs, following state-of-the-art designs like R’s dplyr and Python’s Pandas 2.0, and taking full advantage of Julia’s flexibility and performance. Your help is welcome to push forward with this roadmap!

cormullion · November 12, 2016, 9:19pm

Perhaps you could give it a different name. Then people could transition in their own time…

ssfrr · November 12, 2016, 9:29pm

Fortunately it’s possible to stick with the older version of a package if you need to. For package developers just put an upper-bound on DataFrames in your REQUIRE file. As a user you can use Pkg.pin to prevent Julia from updating the package until you’re ready.

kevin.squire · November 19, 2016, 5:29pm

I actually agree with @cormullion that a new package name might be (or have been?) better. Perhaps even DataFrames2?

ChrisRackauckas · November 19, 2016, 6:00pm

I don’t know. DataFrames is a good name for this. People know the name and will search for it. Deprecating the name in order to help people with an API change pre-1.0 seems like prematurely trying to enforce stability.

nalimilan · November 19, 2016, 7:23pm

Using a different package name wouldn’t help much, since DataFrames is a dependency of many other packages which will need to choose which version to use (until we have optional dependencies at least). It would be really confusing to have a DataFrame object which isn’t supported by e.g. Gadfly because it expects a DataFrame type from the other DataFrames package.

On the contrary, pinning the package to the 0.8 version will ensure all dependencies are compatible (if we add upper bounds correctly) until you are ready to make the switch.

kevin.squire · November 20, 2016, 6:29am

Okay, fair enough. Cheers!

viralbshah · November 20, 2016, 9:03pm

I think this is worth posting as a blog post on the Julia blog.

cstjean · November 21, 2016, 12:00am

Not a R user — what is this referring to? df[df[:something] .== 5] ?

jpfairbanks · November 21, 2016, 12:33am

Great news,

I am glad to see progress here, and that it is being communicated like this. Is there a list of issues that need to get done before the release?

I found Milestone 0.9.0 and Issue #1092. Are these the complete lists? Maybe this is a good use for the new Github Projects feature.

-James

nalimilan · November 21, 2016, 9:32am

Not the best place to ask, please look at the docs or start a new thread.[quote=“jpfairbanks, post:11, topic:266”]
I found Milestone 0.9.0 and Issue #1092. Are these the complete lists? Maybe this is a good use for the new Github Projects feature.
[/quote]

Yes, that’s more or less complete. I don’t think we need to use the Projects feature, as most of the work needs to happen outside of DataFrames now (in particular, in NullableArrays, StructuredQueries and Julia Base).

nalimilan · November 23, 2016, 9:41am

I’m not sure, as blog posts usually have announced work when it was more or less completed. I’d prefer to wait until we actually release the new framework, which is when we should draw the most attention to it.

zhangliye · February 23, 2017, 2:36am

What are the advantages of DataTables.jl compared with DataFrames.jl? From the perspective of performance, which package should I choose? Thanks!

nalimilan · February 23, 2017, 9:53am

The main advantage of DataTables is type stability (at the column level), but whether it will make your code faster will depend on many things, so it’s hard to tell without trying. Also DataTables hasn’t been really optimized yet, though we’re working on it (e.g. this PR).

If you use high-level APIs like Query.jl it should be easy to switch from one framework to another to compare them. If you work directly with column vectors, I strongly recommend using Julia 0.6, where missing values lifting is supported with element-wise operators like .+.

sylvaticus · February 23, 2017, 11:18am

Where is the documentation for DataTables (the links at https://github.com/JuliaData/DataTables.jl result in a 404 (page not found) error).

Steven_Sagaert · February 23, 2017, 11:32am

I wouldn’t just dismiss the smart indexing approach for querying a data table. I think there’s merit in supporting both the indexing & the LINQ/dplyr approach. In R when working with in-memory tables I prefer the indexing approach of data.table both for speed & conciseness. See r - data.table vs dplyr: can one do something well the other can't or does poorly? - Stack Overflow for a discussion, especially the second reply in favor of data.table.

nalimilan · February 23, 2017, 1:10pm

@sylvaticus The manual didn’t build correctly until recently because Query.jl needed to be updated. Now there remains a small bug that this PR should fix.

@Steven_Sagaert I didn’t say we dismissed that syntax, but currently it requires some changes when moving between DataFrames and DataTables, so that’s not the best way to compare them. Also the syntax to work with Nullable isn’t stabilized yet, so it’s easier to work with high-level APIs for now. Finally, performance shouldn’t be higher with direct indexing, at least with optimized querying frameworks. (BTW, DataFrames/DataTable does not support data.table-like advanced syntax, only the basic indexing API.)

zhangliye · February 23, 2017, 5:49pm

You can get some information from the man directory.
https://github.com/JuliaData/DataTables.jl/blob/master/docs/src/man/getting_started.md

tshort · February 24, 2017, 1:52am

Another package to consider is IndexedTables, particularly if you like R’s data.table (I do!).

It’s type stable, and the indexing means lookups and joins should be fast.

Note that this package is not registered, yet.

Topic		Replies	Views
DataFrames 0.11 released Data announcement	27	11448	December 19, 2017
DataTables or DataFrames? Data question	32	15373	November 19, 2018
Announcement: An Update on DataFrames Future Plans Data announcement	41	9247	December 27, 2017
Is there light at the end of the DataFrames tunnel? Data question	36	4297	November 24, 2017
JuliaDB, dataframes: Speculations over the future of data packages Data	24	7433	August 21, 2020

Announcement: DataFrames Future Plans

Towards DataFrames 0.9.0

New APIs and Compatibility Breaks

More Changes

Related topics