Alright, I finally managed to catch up with this thread and some of the issues that it links to. Sorry for being slow.
My high level comment is that I think we need to get a clear feedback from the core devs whether a strategy that relies on inference is all of a sudden blessed. The message I got from pretty much all the core devs was really clear: don’t use inference to pick container types. If they have changed their mind about that, it seems fine to investigate that route, but if not, it seems like a dead end to me. And at least I have not gotten any indication that there is a change in guidance on that point.
So my current plan for Query.jl is to actually get rid of the reliance on inference. I’ve put a fair bit of time and effort into that already. If there was a change in opinion about inference on the part of the core devs, I’d love to know because it would mean I don’t have to spend time on that effort. Having said that, the current version of inference clearly warrants a move away from it in the Query.jl world, there are lots and lots of situations where things fail because inference can’t resolve things. I’ve pretty much mapped out how I can do that at this point. Essentially I would use the current map
strategy throughout Query.jl. That will work great and be fast with DataValue
, but I don’t see how it would work with Union{T,Missing}
, because of the issues I described above.
So I’m not sure we are any closer to a solution for this situation than we were whenever we started talking about this problem…
Alright, now some random reactions to various things from above:
@nalimilan I looked at https://github.com/JuliaLang/julia/pull/25553, and just want to make sure I understand it correctly. That is essentially an implementation of what I described as case “1. Plain array sink”, right? So it gets you an array of the right type, but it ends up copying things n
times in the process, so it really is quite inefficient?
Also, does this really work on master
? I just tried the following:
julia> [(3,missing),(missing,6.)]
2-element Array{Tuple{Any,Any},1}:
(3, missing)
(missing, 6.0)
julia> [(a=3,b=missing),(a=missing,b=5.)]
2-element Array{NamedTuple{(:a, :b),T} where T<:Tuple,1}:
(a = 3, b = missing)
(a = missing, b = 5.0)
so in neither case does it produce the type of array we would like to see, right? Or was that PR meant for some other scenario?
In general, the design with promote_typejoin
seems nice in that it allows new types to plug into that mechanism. But on the other hand, it is also a pretty clear case of what I meant when I wrote in some other thread a while ago that the Union{T,Missing}
approach really leads to a design that is not very composable: now types like NamedTuple have to explicitly opt in and provide code to handle these cases with missing values. In my mind that is the poster child of an uncomposable design: named tuples have nothing to do with missing data, and missing data has nothing to do with named tuples. In a composable system these two would just work together without extra integration code. This is a real issue, at least for Query.jl: one of the core design goals of Query.jl is to not just work for tabular data, but all sorts of data. So named tuples are just one case of a much larger class of structures that Query.jl needs to work with. But now, any other structure that one might want to use in the place of a named tuple needs to provide the same integration via promote_typejoin
with the missing data story. I guess it can work, but it does not strike me as a really good design.
@piever made a point above that actually made me realize that there is another issue related to what we are discussing here that we haven’t even talked about, but that also seems tricky. So in a iterator pipeline with named tuples with missing data we really can have two different situations with the Union{T,Missing}
story: early on in the pipeline, when we iterate directly from say a DataFrame
, the stream of elements would all have the same element type, i.e. the named tuples fields would always be Union{T,Missing}
for columns that can have missing values. But after a map
(or similar operation, there are lots of those in Query.jl), we’ll have these streams of 2^n different named tuple types. So now, if we chain another iterator after such a map
, and that iterator is a higher order function like say filter
, then we have another, new problem: the anonymous function in the filter would now probably compile specialized methods for 2^n different type signatures, because it of course would be called with the 2^n different named tuple types. That can’t be good . So this is actually quite a distinct problem from the ones we have discussed so far.
I also just saw @StefanKarpinski message re julia 1.0 and that the new optimizer is meant to help with the woes in the data space. Does someone have more details about that? What I’ve seen so far seemed to address the issues from the iteration protocol, which seem quite distinct from the issues here, so it would be great to hear more to what extend the new optimizer would actually help with the problems in the data stack. Am I wrong to assume that in principle a new optimizer could help with the performance problem of anonymous functions that return 2^n types, but not really address the other (probably more difficult) issues discussed here?