UPDATE: the plan described below is going to be implemented in a different way from what was initially announced. The DataFrames package will remain the same as it is now: instead, the new framework will be provided by the new DataTables package. This will allow for a less disruptive migration of existing code relying on DataFrames.
UPDATE 2: an updated summary is available in this post.
Towards DataFrames 0.9.0
The DataFrames package and the surrounding ecosystem are currently undergoing a deep refactoring in development branches, based on a framework developed over the last two years. This work aims to dramatically improve performance by replacing the DataArray
type (and its NA
value representing missingness) with the new Nullable
, NullableArray
(see this blog post) and CategoricalArray
types. Please refer to this blog post for an explanation of the limitations of the current design based on DataArray
. The new framework is planned to be released as version 0.9.0 in early February 2017
New APIs and Compatibility Breaks
Despite our efforts to preserve backward compatibility, this change will likely break some existing workflows. The standard indexing approach (inherited from R) will no longer be the recommended interface. Instead, convenient, flexible and efficient high-level APIs inspired by the dplyr R package, by SQL or by LINQ will be preferred. Users are encouraged to experiment with these approaches even with the current stable DataFrames release (0.8.x series), via the DataFramesMeta and Query packages. Eventually, an API based on the StructuredQueries package (see this blog post), which is still in development, will be provided. Among other advantages, these high-level APIs will eventually support different data sources, from in-memory data frames to out-of-core databases, with very little code changes.
The new DataFrames release will require adjustments from all packages depending on DataFrames. Until then, development will continue to happen on the master
branch of the git repository. In many cases, both the new and the old frameworks can be supported in parallel (by supporting both DataArray
and NullableArray
): when possible, package authors are encouraged to start porting as soon as possible. The porting work is tracked in a GitHub issue; take inspiration from existing pull requests, and do not hesitate to ask for help there if needed.
Motivated users can also experiment with the development version, though be warned that the user experience can currently be frustrating due to incomplete support for Nullable
in Julia and in high-level APIs. This issue, known as “lifting” (see this discussion and this one, as well as linked pages), still requires fundamental changes. We expect these to be complete by early January 2017 to allow for a progressive migration; users are not advised to upgrade to the development version for actual work until then.
More Changes
The above changes will be coordinated with a related refactoring of the DataFrames codebase to increase modularity and :
- CSV reading and writing support (
readtable
andwritetable
) will be deprecated in favor of the CSV package. Data importation and exportation should more generally be done via the DataStreams package (see this blog post). - Functions translating model formulas into model matrices will be moved to a separate StatsModels package, with the goal of eventually supporting any kind of
AbstractTable
(includingDataFrame
), and will also include model-related functions currently in StatsBase. Though this will not happen in the first release, in the end modeling packages should only need to depend on that package, and no longer on DataFrames. - A new
AbstractTable
interface will be progressively developed in the eponymous package to allow writing generic code supporting any kind of tabular data, includingDataFrame
, without depending on the DataFrames package. - Packages strongly tied to DataFrames (including that package itself) will be moved to the JuliaData organization to keep JuliaStats focused on actual statistics.
We are aware that the transition will certainly be disruptive for users. But we are confident the advantages of the new framework will greatly offset its costs, following state-of-the-art designs like R’s dplyr and Python’s Pandas 2.0, and taking full advantage of Julia’s flexibility and performance. Your help is welcome to push forward with this roadmap!