By adding support for missing data and NamedTuples Julia is offering very solid foundations to build a data ecosystem upon. While I’m extremely happy that these two features have been added (especially given the extensive discussion and work that they required), I’d like to show that there are some scenarios where I’m afraid they do not play nicely with each other and see what are possible solutions (I’m not an expert on the technical side though, so my ideas in this respect may be flawed/unfeasible). This issue has been already mentioned elsewhere, but I hope it can be useful to have a writeup that’s understandable also for non-experts (such as myself).
Let’s start with a concrete example. When updating the package MySQL to the new version of DataFrames, it became clear that there was something tricky with the corresponding row iterator. To be more specific, let’s imagine that I’m streaming data from a remote dataset with two columns, :x
and :y
, which contain integers and can both have missing data. A first obvious attempt is to iterate each row as a tuple, with a value for :x
and one for :y
. However, this is type unstable, as every row could then have 4 different types:
Tuple{Int, Int}
Tuple{Int, Missing}
Tuple{Missing, Int}
Tuple{Missing, Missing}
This seems bad as the number of possible types increases exponentially with the number of columns of the dataset. Note that explicitly typing the returned Tuple doesn’t actually do anything, as Tuples are covariant, therefore:
julia> typeof(Tuple{Union{Missing, Int64}, Union{Missing, Int64}}((1, 2)))
Tuple{Int64,Int64}
The next attempt is to use NamedTuples instead as IIUC they are not covariant so the explicit typing is effective:
julia> typeof(NamedTuple{(:x, :y),Tuple{Union{Missing,Int}, Union{Missing, Int}}}((1, 2)))
NamedTuple{(:x, :y),Tuple{Union{Missing, Int64},Union{Missing, Int64}}}
This effectively solves the problem of what a data source should iterate: an explicitly typed NamedTuple where some fields explicitly accept missing data.
However, as was mentioned here by @davidanthoff, applying a function elementwise to a vector of NamedTuples would quickly lead to type instabilities:
julia> v = NamedTuple{(:x, :y),Tuple{Union{Missing,Int}, Union{Missing, Int}}}.([(1,1),(1,missing),(missing,1),(missing,missing)])
4-element Array{NamedTuple{(:x, :y),Tuple{Union{Missing, Int64},Union{Missing, Int64}}},1}:
NamedTuple{(:x, :y),Tuple{Union{Missing, Int64},Union{Missing, Int64}}}((1, 1))
NamedTuple{(:x, :y),Tuple{Union{Missing, Int64},Union{Missing, Int64}}}((1, missing))
NamedTuple{(:x, :y),Tuple{Union{Missing, Int64},Union{Missing, Int64}}}((missing, 1))
NamedTuple{(:x, :y),Tuple{Union{Missing, Int64},Union{Missing, Int64}}}((missing, missing))
julia> map(i -> (s = i.x, t = i.x+i.y), v)
4-element Array{NamedTuple{(:s, :t),T} where T<:Tuple,1}:
(s = 1, t = 2)
(s = 1, t = missing)
(s = missing, t = missing)
(s = missing, t = missing)
Note that here the resulting Vector has lost the typing of its NamedTuples.
This I believe would be a problem both for Query.jl and JuliaDB.jl, were they to transition to Missing, as they both have this map
operator (in JuliaDB it actually is map
, in Query it is @map
or @select
and corresponds to a LINQ select
statement). The only way to rescue this, right now, would be to explicitly type the outcome of the anonymous function that map
uses, which would greatly damage usability.
Given that a Vector of NamedTuples has been proposed as a possible “general” table representation and that being unable to apply map
to it in a type stable way is a clear drawback, I wondered whether we need a new type, something like a NullableNamedTuple, which by default allows missing data. For example NullableNamedTuple{(:x, :y),Tuple{Int64, Int64}}
would be equivalent to NamedTuple{(:x, :y),Tuple{Union{Missing, Int64},Union{Missing, Int64}}}
. Then, one could use map(f, v)
where v
is a vector of NullableNamedTuples and f
takes as input a NullableNamedTuple and outputs a NullableNamedTuple. Note that this would be reasonably easy to incorporate in Query as it already uses a special syntax {...}
for NamedTuples and it would maybe be possible to use that syntax to instead mean NullableNamedTuple.
Another alternative would be to have a simplified syntax to declare which fields would accept missing data. For example (the notation is made up) (x ?= 1, y = 2)
would be of type NamedTuple{(:x, :y),Tuple{Union{Missing, Int64}, Int64}}
. In this case, to apply map
the user would have a slightly less convenient syntax and would have to type:
map(i -> (s ?= i.x, t ?= i.x+i.y), v)
The trade-off here would be whether it’s better to be explicit about which fields can have missing data, at the cost of a more verbose syntax.