Instability in StatsModels.jl term creation

I am trying to follow closely the documentation of StatsModels.jl to get a function term working as syntaxic sugar to define survival response in regression models. The issue is that i get a lot of type instabilities… I managed to get down to the following MWE :

using StatsModels, StatsAPI, DataFrames

struct SurvTerm{X, Y} <: AbstractTerm
Surv(T::Float64, Δ::Bool) = (T, Δ)
function StatsModels.apply_schema(t::FunctionTerm{typeof(Surv)},
    return apply_schema(SurvTerm(t.args...), sch, Mod) 
function StatsModels.apply_schema(t::SurvTerm{X,Y},
    Mod::Type{<:Any}) where {X,Y}
    T = apply_schema(t.T, sch, Mod)
    Δ = apply_schema(t.Δ, sch, Mod)
    isa(T, ContinuousTerm) || throw(ArgumentError("Surv only works with continuous terms (got $T)"))
    isa(Δ, ContinuousTerm) ||  throw(ArgumentError("Surv only works with discrete terms (got $Δ)"))
    return SurvTerm(T, Δ)
function StatsModels.modelcols(t::SurvTerm, d::NamedTuple)
    T = modelcols(t.T, d)
    Δ = modelcols(t.Δ, d)
    return hcat(T,Δ)

# then test it : 
df = DataFrame(
    time = -log.(rand(1000)), 
    status = rand(1000) .<= 0.8, 
    sex = ifelse.(rand(1000) .<= 0.5, :male, :female)

@code_warntype @formula(Surv(time,status)~sex) # already unstable
@code_warntype schema(df) # stable. 
@code_warntype apply_schema(@formula(Surv(time,status)~sex), schema(df)) # very unstable.

Can you help me understand what I missed ?

We discussed this a bit on slack, but wanted to follow up here. Thanks for the MWE!

Type instability isn’t necessarily a problem per se, and it’s basically unavoidable in the design that StatsModels currently uses (with the types of the output of apply_schema dependent on the values in the schema). I’ll also note that the example is still type-unstable without using Surv at all (e.g., @formula(time~sex)), so it’s not that you’re doing anything wrong here.

Where type instability can cause an issue is if it prevents the compiler from being able to generate efficient, specialized code for numerical operations. Even when there’s type instability, as long as the performance-critical stuff is behind a function boundary, the types of the arguments can be specialized on.

So, this MWE demonstrates that there’s type instability, which isn’t IMO unexpected and basically impossible to remove from StatsModels without putting a lot more type information into the schema type (e.g,. making it a NamedTuple or something like that and using generated funtions to lookup values) and requiring that all tables are also fully typed (again, using something like a NamedTuple-of-Vectors and using generated functions to look columns up). I haven’t seen this instability causing performance problems in practice, but if it is causing performance issues I’d love to see an MWE of that and we’ll see if there’s anything we can do!

1 Like