Announcement: An Update on DataFrames Future Plans

announcement

#1

This post summarizes the current roadmap for DataFrames.jl and the data ecosystem in general for the Julia 1.0 milestone. Plans have changed significantly since the previous post. Indeed, it has appeared that representing missing values via the special Nullable type is not the only solution to attain a high performance in Julia. Thanks to work done by @jameson and @quinnj, it turns out that representations of nullable values similar to that used by DataArrays (i.e. Union{T, NAtype}) can also be made efficient by improving the compiler. We hope that these improvements will be available in Julia 0.7 and 1.0, but this approach already works in existing Julia versions, with the “only” limitation that is slow. Compared with Nullable, it has the advantage that applying operations on nullable values or arrays does not require unwrapping the values, which has turned out to be cumbersome in many practical applications.

Compiler improvements will also ensure that Array{Union{T, NAtype}} objects are stored efficiently in memory. Instead of using the current inefficient memory layout, they will be stored as an Array{T} associated with an Array{UInt8} of the same size, the former holding the values, and the latter indicating the type of value contained in the value memory. This memory layout matches very closely that of DataArray and NullableArray, therefore removing the need to use these custom types instead of the standard Array. This means that nullable arrays automatically work when combined with array comprehensions, map or broadcast: whether the result is a nullable or a non-nullable array will simply depend on whether the operation returns null values or not.

Finally, in order to improve the usability of the Union representation of missingness and the consistency with other languages, the NA value (of type NAtype) will be replaced with null (of type Null). A nullable (i.e. possibly missing) value will be of type Union{T, Null}, written as ?T in the short form (in Julia 1.0, this will be changed to T? for consistency with languages such as C# or Swift, but this syntax is not available in current Julia versions). Functions which are known to work both with arguments of type T and with missing values will have to declare ?T / T? or Array{<:?T} / Array{<:T?} in their signatures. This will require that standard functions and operators be defined manually where it makes sense (“whitelist approach”): for example, +(x::Any, y::Null) = null and +(x::Null, y::Any) = null will ensure that addition is supported for all types and propagates missing values. These features are currently developed in the Nulls.jl package, with the goal of integrating them in Julia itself so that Base functions like sum will accept nullable arrays.

A pull request is currently under review to port DataTables to the new framework. However, since the Union{T, Null} representation of missingness is actually closer to the Union{T, NAtype} representation used by DataArrays and DataFrames than to the Nullable{T} representation used by NullableArrays and DataTables, and since the DataFrames package is more established than DataTables, further developement will happen in DataFrames (after backporting features from DataTables). The transition to the new framework should be possible without breaking the existing code (or at least not too much). Other packages in the ecosystem have also started to be ported: this includes CategoricalArrays and DataStreams.

We hope this roadmap will put an end to the uncertainty around the future of the data ecosystem in Julia which lead to fragmentation over the last year. The Union{T, Null} approach appears as the perfect choice since takes advantage of Julia’s strengths, including standard structures like Array, type inference and multiple dispatch, while being consistent with other languages and mostly compatible with the previous design of DataFrames.

UPDATE: see the DataFrames 0.11 announcement.


Getting our act together in the data ecosystem
Announcement: DataFrames Future Plans
Issue with DataFrames, operations on DataFrames now return Nullable Arrays?
Question mark in variable names
Looking for technical reviewer for my upcoming Julia book
#2

This sounds excellent! Thanks for all the great work!


#3

Great to hear this. Just to be clear: in the meantime (until the ?T / T? syntax is resolved), it should be OK to write Union{T, Null} everywhere? With the understanding of course that it may not be as fast as it will eventually be.


#4

Yes, Union{T, Null} should be fine and will be equivalent to T? once that syntax is available.


#5

Great news. This answers the first question I am asked when introducing R-users to Julia.

I have one question (as it is not clear to me from the post):
what are the plans for Nullable - will it be removed, redefined so that Nullable{T} will be equivalent to ?T / T? or left as currently defined in 0.6?


#6

I think what we’d like to do is have two distinct representations of “nullability”:

  1. The data analyst’s null. This is like NaN but for any type. This is what null from the Nulls.jl package provides.

  2. The software engineer’s null. This is effectively (or at least conceptually) equivalent to Nullable{T}, a container with 0 or 1 element that is useful for things like Keno’s proposed iteration protocol. No arithmetic should be defined on this.

To answer your question, we’d like to keep the latter but rename it to something like Option{T} (akin to Rust) or Maybe{T}. So there will still be a container-based null, but it should not be used for representing missing data in the statistical sense, only in the programming sense (where everything is data).

Hopefully that wasn’t more confusing. :neutral_face:


#7

What is the advantage of this over just using null for everything? For example, what is the downside of having something like tryparse(T, s) return T? instead of Nullable{T}? Entia non multiplicanda sunt


#8

To answer this I shall quote our beloved friend and Nullable visionary, John Myles White:

There are two almost totally irreconcilable reasons you might care about Julia’s Nullable type:

  • You are a software engineer and you deal with things that are “null” – like null pointers and null handles to databases. You absolutely do not want such a value to propagate automatically. Many systems have been rebuilt almost from scratch (Javascript in Flow and PHP in Hack) in part because automatic propagation of nulls is a nightmare for large-scale software design.
  • You are a data analyst and you want to deal with missing values. You want to be able to execute arbitrary expressions against databases that may contain missing values and you want these missing values to be propagated automatically.

Julia’s Nullable type was not meant to be optimized for either use case-- it was meant to be a building block for other packages to expand upon. This has come into direct conflict with the community’s increasing desire to prevent type piracy. There isn’t a simple solution, but it’s worth keeping in mind that many of the proposals in this thread aren’t Pareto improvements for the Julia community: allowing expressions like 1 + Nullable(1) is likely to do harm to software engineers in order to benefit data analysts.

(taken from this discussion on GitHub)


#9

Let’s have the discussion about the future on Nullable on GitHub. This is indeed a major question to address before Julia 0.7.


#10

Great news!


#11

Thanks for the update.

This may seem like a minor issue, but, at least for the time being, the use of the ? character is a big headache for me and, I would imagine, many others as it totally screws up julia-vim’s parsing (is it also a problem for Juno?). I’m begging you to use literally any other valid character unless there is a simple workaround.

Also, I have one major technical concern: even assuming the memory layout of arrays is absolutely pristine and causes no performance issues, won’t this new approach cause excessive amounts of run-time dispatch? For example, if I have an A::Array{Union{Float64, Null}}, when I call A[1] the compiler cannot know whether the result will be a Float64 or a Null, and dispatching some function with methods f(x::Float64), f(x::Null) can take more than 100ns even in this simplest case. This seems like a big problem, has thought been given on how to reduce the dispatch time in these cases? Perhaps my benchmarking has been deceptive for some reason?


#12

As I said, that convention is used by several major languages already, so we should rather fix the parsers. How do they handle e.g. Swift?

To be clear, the performance optimizations are not yet available, so benchmarks are expected to be slow. I’m not the best person to talk about that, but I think dispatch can be made efficient for small Unions since the code does not need to look for a method in a large methods table at runtime: the compiler knows at compile time that x is either Null or Float64, so it just needs to introduce a branch to call either one method or the other. In addition, in many cases you’ll have f(::Null) = null, so the call will be completely inlined for the null branch. I think @jameson even said that it should be possible to have SIMD-enabled loops calling operations on Union{T, Null} arguments (i.e. eliminating any branch).


#13

If anyone knows of any simple or hackish workarounds for julia-vim, I’d be extremely grateful.


#14

Yes. Just to be explicit, f(a[1]) will optimize to something like

if a[1] == null
  f(a[1]::Null)  # Static dispatch
else
  f(a[1]::Float64)  # Static dispatch
end

#15

With the future Array{Union{T, NAType/Null/Whatever}}, could a BitArray be used instead of the Array{UInt8} to reduce memory used for the NA flags?


#16

I thought this optimization already existed for unions with a small number of elements.


#17

I’m not sure really, I know people were talking about it at some point. If we already have this, then we are much closer to replacing Nullable with the union representation than I thought.


#18

That’s one of the options, it’s been discussed in this issue.


#19

10 posts were split to a new topic: Representing Nullable Values


Representing Nullable Values
Representing Nullable Values
#20

Potentially, but there are tradeoffs with performance. For now, it will be one extra byte per array element. Also note that the current compiler improvements underway generalize to more than just Vector{Union{T, Null}}; it supports Arrays of arbitrary dimensions, Unions with more than 2 unioned types, and supports the modifying operations of Vector, (i.e. push!, append!, deleteat!, etc.).

We could certainly explore the limited case where you only have a Union of 2 types and using a single bit per array element, but it would slow down indexing with the additional byte->bit manipulations needed.