Query.jl: collect's return type is not as expected

Hello, here’s something I don’t understand about Query.jl: After running a few query commands and calling collect, I was expecting the result to be of type Array{MyObs{Float64}, 1}, but got Array{MyObs,1} instead. When I checked the type of each element of the array, it’s indeed of type MyObs{Float64}. So why is collect not returning Array{MyObs{Float64}, 1}?

Reproducible example:

using LinearAlgebra, Random, StatsModels, Query, DataFrames
struct MyObs{T <: LinearAlgebra.BlasReal}
function MyObs(
    ) where T <: LinearAlgebra.BlasReal
    xty = transpose(X) * y
    zty = transpose(Z) * y
    MyObs{T}(y, X, Z, xty, zty)
function myobs(data_obs, feformula::FormulaTerm, reformula::FormulaTerm)
    y, X = StatsModels.modelcols(feformula, data_obs)
    Z = StatsModels.modelmatrix(reformula, data_obs)
    return MyObs(y, X, Z)
reps = 20; N = 10; p = 5; q = 2
X = Matrix{Float64}(undef, N*reps, p)
rand_intercept = zeros(N*reps)
for j in 1:N
    rand_intercept[(reps * (j-1) + 1) : reps * j] .= Random.randn(1)
y = X * ones(p) + rand_intercept + Random.randn(N*reps) 
id = repeat(1:N, inner = reps)
dat = hcat(rename!(DataFrame(hcat(id)), [:id]), DataFrame(hcat(y, X)))
rename!(dat, Symbol.(["id", "y", "x1", "x2", "x3", "x4", "x5"]))
function test_id(subset_id::Vector{T}, x::T, k::Int) where T
    # test whether each element of x is in subset_id
    res = searchsortedfirst(subset_id, x) <= k
    return res
k=5; subset_id = [1:1:5;]
feformula   = @formula(y ~ 1 + x1 + x2 + x3 + x4 + x5)
reformula   = @formula(y ~ 1)
feformula = apply_schema(feformula, schema(feformula, dat))
reformula = apply_schema(reformula, schema(reformula, dat))

Then running

obsvec = dat |> @groupby(_.id) |> @filter(test_id(subset_id, key(_), k)) |> @map(myobs(_, feformula, reformula)) |> collect

We get


This is an unfortunate outcome of a reliance on type inference in the EnumerableMap type. It predetermines the output eltype by calling:

T = Base._return_type(f, Tuple{TS,})

where f in this case is essentially your myobs function and Tuple{TS,} is the expected type of the previous operations in the chain (specifically, Tuple{Grouping{Int64,NamedTuple{(:id, :y, :x1, :x2, :x3, :x4, :x5),Tuple{Int64,Float64,Float64,Float64,Float64,Float64,Float64}}}}).

My guess is that the call to myobs(_, feformula, reformula) is just sufficiently complex that the compiler can’t guarantee the return type will be MyObs{Float64}, so the best it can do is MyObs. This might be affected by a # of things, like the transpose code inferrability, StatsModels.modelcols or StatsModels.modelmatrix, FormulaTerm, or just the plain nesting complexity of everything here.

In any case, by calling Base._return_type, it commits the output eltype to whatever the compiler can figure out pre-execution, so when the result is materialized (via collect), it asks whether Base.IteratorEltype is known (in this case yes) and uses that to materialize the output array.

If instead, EnumerableMap defined:

Base.IteratorEltype(::Type{<:EnumerableMap}) = Base.EltypeUnknown()

then a different collect algorithm is used where the output array type is “promoted” as elements are iterated. Which introduces one step of type instability (i.e. at least the initial call to iterate + array allocation), but can lead to more accurate output type. I tested this locally and it indeed returns 5-element Array{MyObs{Float64},1}:.

There are trade-offs between both approaches and even the Base.collect algorithm tries to use a hybrid approach between inspecting Base._return_type and just “growing” the output container. It’s actually one of the more interesting “dynamic” problems that Julia has vs. other languages, IMO, and it’s really interesting to see different approaches and the resulting side effects.

Hope that helps?

1 Like

Thank you very much! This is really comprehensive.
Perhaps it’s also possible to give users control over the output type as in Base.collect?

Yeah, I didn’t think of that, but that’s definitely a work-around here, like:

julia> collect(MyObs{Float64}, obsvec)
5-element Array{MyObs{Float64},1}:

The other idea I had was that you could make your own collect that essentially ignored eltype and only built up the container type as it iterated elements. I think this is somewhat the idea in GitHub - JuliaFolds/BangBang.jl: Immutables as mutables, mutables as immutables., but I haven’t dug into that code very deeply.

1 Like