This post summarizes the current roadmap for DataFrames.jl and the data ecosystem in general for the Julia 1.0 milestone. Plans have changed significantly since the previous post. Indeed, it has appeared that representing missing values via the special Nullable
type is not the only solution to attain a high performance in Julia. Thanks to work done by @jameson and @quinnj, it turns out that representations of nullable values similar to that used by DataArrays (i.e. Union{T, NAtype}
) can also be made efficient by improving the compiler. We hope that these improvements will be available in Julia 0.7 and 1.0, but this approach already works in existing Julia versions, with the “only” limitation that is slow. Compared with Nullable
, it has the advantage that applying operations on nullable values or arrays does not require unwrapping the values, which has turned out to be cumbersome in many practical applications.
Compiler improvements will also ensure that Array{Union{T, NAtype}}
objects are stored efficiently in memory. Instead of using the current inefficient memory layout, they will be stored as an Array{T}
associated with an Array{UInt8}
of the same size, the former holding the values, and the latter indicating the type of value contained in the value memory. This memory layout matches very closely that of DataArray
and NullableArray
, therefore removing the need to use these custom types instead of the standard Array
. This means that nullable arrays automatically work when combined with array comprehensions, map
or broadcast
: whether the result is a nullable or a non-nullable array will simply depend on whether the operation returns null values or not.
Finally, in order to improve the usability of the Union
representation of missingness and the consistency with other languages, the NA
value (of type NAtype
) will be replaced with null
(of type Null
). A nullable (i.e. possibly missing) value will be of type Union{T, Null}
, written as ?T
in the short form (in Julia 1.0, this will be changed to T?
for consistency with languages such as C# or Swift, but this syntax is not available in current Julia versions). Functions which are known to work both with arguments of type T
and with missing values will have to declare ?T
/ T?
or Array{<:?T}
/ Array{<:T?}
in their signatures. This will require that standard functions and operators be defined manually where it makes sense (“whitelist approach”): for example, +(x::Any, y::Null) = null
and +(x::Null, y::Any) = null
will ensure that addition is supported for all types and propagates missing values. These features are currently developed in the Nulls.jl package, with the goal of integrating them in Julia itself so that Base functions like sum
will accept nullable arrays.
A pull request is currently under review to port DataTables to the new framework. However, since the Union{T, Null}
representation of missingness is actually closer to the Union{T, NAtype}
representation used by DataArrays and DataFrames than to the Nullable{T}
representation used by NullableArrays and DataTables, and since the DataFrames package is more established than DataTables, further developement will happen in DataFrames (after backporting features from DataTables). The transition to the new framework should be possible without breaking the existing code (or at least not too much). Other packages in the ecosystem have also started to be ported: this includes CategoricalArrays and DataStreams.
We hope this roadmap will put an end to the uncertainty around the future of the data ecosystem in Julia which lead to fragmentation over the last year. The Union{T, Null}
approach appears as the perfect choice since takes advantage of Julia’s strengths, including standard structures like Array
, type inference and multiple dispatch, while being consistent with other languages and mostly compatible with the previous design of DataFrames.
UPDATE: see the DataFrames 0.11 announcement.