After a long and complex development period, we are glad to announce that version 0.11.0 of DataFrames has been released. Among other features listed in the release notes, the major change introduced by this version is the move from the NA value (from the DataArrays package) to the new missing value (from the Missings package, and soon in Base).
DataFrames have been completely decoupled from DataArrays: the DataFrame constructor will no longer convert columns to DataArrays, but will keep them as they are. DataFrame columns can therefore be either plain Vector{T} objects (without support for missing values), Vector{Union{T, Missing}} (supporting missing values), DataVector{T} (by creating such vectors manually), or any other AbstractVector object. Thanks to improvements in the compiler, Vector{Union{T, Missing}} uses an efficient storage similar to DataArray on Julia 0.7, and should generally behave like DataVector{T}. The latter type can still be used for optimal performance, especially on Julia 0.6 (but also on 0.7, since not all Vector{Union{T, Missing}} optimizations have been implemented yet).
As part of the separation of features into independent packages, PooledDataArray has been deprecated in favor of either CategoricalArray or PooledArray. Indeed, PooledDataArray suffered from a lack of clarity regarding its goals: it was at the same time a way to efficiently store data with a small number of unique values, a way to represent categorical data, and it always supported missing values. Categorical data should now be stored using CategoricalArray, which supports both nominal and ordinal variables and allows comparing elements using operators such as <. Non-categorical data with a small number of unique values should be stored using the PooledArray type. These two types can either accept missing values or not, depending on the needs.
Functions to import/export CSV (readtable and writetable) have been deprecated in favor of CSV.read and CSV.write from the CSV package. This allows sharing code and combining our efforts with all other packages working with data.
Finally, modeling features have been moved to a separate StatsModels package. This difference should only be visible to authors of modeling packages, which should now use that package instead of depending on DataFrames. The objective is to allow modeling packages to support any type of data structure automatically.
The porting process should be relatively straightforward. Deprecation warnings are printed, keeping the current code working in many cases (but unfortunately not all cases). NA should be replaced with missing, NAType with Missing and isna with ismissing everywhere. Functions dispatching on DataArray or AbstractDataArray should use AbstractArray{Union{T, Missing}} or AbstractArray{>:Missing} instead, which will match (among others) DataArray{T}. The na.rm=true argument should be replaced with skipmissing, e.g. sum(skipmissing(x)). PooledDataArray should be replaced with either CategoricalArray or PooledArray, which will require some adjustements to the code using such arrays. Code using modeling functions should call using StatsModels first. See the DataFrames manual for a short introduction to missing and CategoricalArray.
We hope that this new, more modular framework will allow for a better interaction between all packages in the data ecosystem. It should pave the road for future improvements to DataFrames and related packages. However, updating all packages to the new framework will take time. A list tracking progress is available here. Your help is welcome! Please also report any bugs you may find.
Also note that until all packages on your local installation have been ported to DataFrames 0.11.0, they will keep requiring version 0.10.1, and the package manager will not update DataFrames to version 0.11.0. If removing the problematic dependencies is not an option, you can use a separate Julia package directory to test the new framework: just set the JULIA_PKGDIR before starting Julia, and run Pkg.add("DataFrame").