After a long and complex development period, we are glad to announce that version 0.11.0 of DataFrames has been released. Among other features listed in the release notes, the major change introduced by this version is the move from the NA
value (from the DataArrays package) to the new missing
value (from the Missings package, and soon in Base).
DataFrames have been completely decoupled from DataArrays: the DataFrame
constructor will no longer convert columns to DataArray
s, but will keep them as they are. DataFrame
columns can therefore be either plain Vector{T}
objects (without support for missing values), Vector{Union{T, Missing}}
(supporting missing values), DataVector{T}
(by creating such vectors manually), or any other AbstractVector
object. Thanks to improvements in the compiler, Vector{Union{T, Missing}}
uses an efficient storage similar to DataArray
on Julia 0.7, and should generally behave like DataVector{T}
. The latter type can still be used for optimal performance, especially on Julia 0.6 (but also on 0.7, since not all Vector{Union{T, Missing}}
optimizations have been implemented yet).
As part of the separation of features into independent packages, PooledDataArray
has been deprecated in favor of either CategoricalArray
or PooledArray
. Indeed, PooledDataArray
suffered from a lack of clarity regarding its goals: it was at the same time a way to efficiently store data with a small number of unique values, a way to represent categorical data, and it always supported missing values. Categorical data should now be stored using CategoricalArray
, which supports both nominal and ordinal variables and allows comparing elements using operators such as <
. Non-categorical data with a small number of unique values should be stored using the PooledArray
type. These two types can either accept missing values or not, depending on the needs.
Functions to import/export CSV (readtable
and writetable
) have been deprecated in favor of CSV.read
and CSV.write
from the CSV package. This allows sharing code and combining our efforts with all other packages working with data.
Finally, modeling features have been moved to a separate StatsModels package. This difference should only be visible to authors of modeling packages, which should now use that package instead of depending on DataFrames. The objective is to allow modeling packages to support any type of data structure automatically.
The porting process should be relatively straightforward. Deprecation warnings are printed, keeping the current code working in many cases (but unfortunately not all cases). NA
should be replaced with missing
, NAType
with Missing
and isna
with ismissing
everywhere. Functions dispatching on DataArray
or AbstractDataArray
should use AbstractArray{Union{T, Missing}}
or AbstractArray{>:Missing}
instead, which will match (among others) DataArray{T}
. The na.rm=true
argument should be replaced with skipmissing
, e.g. sum(skipmissing(x))
. PooledDataArray
should be replaced with either CategoricalArray
or PooledArray
, which will require some adjustements to the code using such arrays. Code using modeling functions should call using StatsModels
first. See the DataFrames manual for a short introduction to missing
and CategoricalArray
.
We hope that this new, more modular framework will allow for a better interaction between all packages in the data ecosystem. It should pave the road for future improvements to DataFrames and related packages. However, updating all packages to the new framework will take time. A list tracking progress is available here. Your help is welcome! Please also report any bugs you may find.
Also note that until all packages on your local installation have been ported to DataFrames 0.11.0, they will keep requiring version 0.10.1, and the package manager will not update DataFrames to version 0.11.0. If removing the problematic dependencies is not an option, you can use a separate Julia package directory to test the new framework: just set the JULIA_PKGDIR
before starting Julia, and run Pkg.add("DataFrame")
.