Safe predictions using `@formula` and regression splines

Dear all,

As my first effort in Julia, I have written a Splines2 package for spline bases for regression modelling (https://github.com/mclements/Splines2.jl). The implementation includes equivalents to the splines::ns, splines::bs, splines2::mSpline and splines2::iSpline functions from R, with names Splines2.ns, Splines2.bs, Splines2.ms and Splines2.is, respectively. These take an argument x=Array{T<:Real,1} and design information and evaluate the spline basis at x. I have also written functions that return functions for evaluating the splines at new x values, with names Splines2.ns_, Splines2.bs_, Splines2.ms_ and Splines2.is_, respectively. Using the latter functions and constructing the design matrix ourselves, we can get “safe” predictions on new data:

    using Splines2, GLM, Random
    Random.seed!(12345);
    x = collect(range(0.0, length=301, stop=2.0*pi));
    y = sin.(x)+randn(length(x)); 
    ns1 = Splines2.ns_(x,df=5,intercept=true) # this is a function
    X = ns1(x);
    fit1 = lm(X,y)
    newx = collect(0.0:0.5:3.5);
    predict(fit1, ns1(newx)) # safe predictions

However, I am uncertain whether @formula from StatsModels will allow for “safe” predictions on new data. Adapting the poly example from https://juliastats.org/StatsModels.jl/latest/internals/, I used:

    using StatsModels
    ns(x,df) = Splines2.ns(x,df=df,intercept=true) # assumes intercept
    const NSPLINE_CONTEXT = Any
    struct NSplineTerm{T,D} <: AbstractTerm
        term::T
        df::D
    end
    Base.show(io::IO, p::NSplineTerm) = print(io, "ns($(p.term), $(p.df))")
    function StatsModels.apply_schema(t::FunctionTerm{typeof(ns)},
                                      sch::StatsModels.Schema,
                                      Mod::Type{<:NSPLINE_CONTEXT})
        apply_schema(NSplineTerm(t.args_parsed...), sch, Mod)
    end
    function StatsModels.apply_schema(t::NSplineTerm,
                                      sch::StatsModels.Schema,
                                      Mod::Type{<:NSPLINE_CONTEXT})
        term = apply_schema(t.term, sch, Mod)
        isa(term, ContinuousTerm) ||
            throw(ArgumentError("NSplineTerm only works with continuous terms (got $term)"))
        isa(t.df, ConstantTerm) ||
            throw(ArgumentError("NSplineTerm df must be a number (got $(t.df))"))
        NSplineTerm(term, t.df.n)
    end
    function StatsModels.modelcols(p::NSplineTerm, d::NamedTuple)
        col = modelcols(p.term, d)
        Splines2.ns(col, df=p.df,intercept=true)
    end
    StatsModels.terms(p::NSplineTerm) = terms(p.term)
    StatsModels.termvars(p::NSplineTerm) = StatsModels.termvars(p.term)
    StatsModels.width(p::NSplineTerm) = 1
    StatsModels.coefnames(p::NSplineTerm) = "ns(" .* coefnames(p.term) .* "," .* string.(1:p.df) .* ")"

For the above example, I then get:

    d = DataFrames.DataFrame(x=x,y=y);
    fit2 = lm(@formula(y~ns(x,5)+0),d) # equivalent to fit1 with nicer labels
    predict(fit2, ns1(newx)) # safe predictions
    predict(fit2, DataFrames.DataFrame(x=newx)) # unsafe predictions!

The issue relates to the design information (in this case, the knots) not being carried with the formula. We could provide the knots information as a hint to the schema for :x, but schema is focused on the individual Term values in the data, rather than the AbstractTerm values in the formula. To do this automatically, we would need to construct the hints in a manner similar to the terms function. The design information has been lost when apply_schema (which does use AbstractTerm values) is called. Is this a design constraint for formula 2.0 – or have I misinterpreted schema.jl and terms.jl?

Sincerely, Mark Clements.

4 Likes

I think what you want to do here is store the knots inside the NSplineTerm itself (instead of just the degrees of freedom). Generally, the concrete term should contain every thing necessary to generate the model columns. It looks like the ns_ function in your first example provides that kind of functionality so why not just store the output of that?

One potential issue I can see is that you’re passing the full x vector to ns_ whereas only the invariants from the schema are available when you’re constructing the term. But that’s why the continuous term type stores the min, max, mean, and std. If that’s not enough information, then you might consider calculating the knots etc lazily (eg, have a field on few term that you initialize with some placeholder like nothing, and compute when modelcols is called and the value is nothing…but that might have performance issues).

As a more extreme possibility, I’d thought about adding an additional apply_context stage that happens before the schema so you could capture custom information in the schema. That would be helpful but not necessary in the things I’ve seen so far but this case might be a good argument for it…

BTW I think this package looks like a very useful addition to the julia stats ecosystem so thanks for bringing it up here! People have been grumbling about spline/basis regression for a while so I’m glad to see someone has just gone and done it :slight_smile:

1 Like

Dave,

Following your suggestion of using the ns_ return value directly, we have the following working solution:

import Splines2, GLM, Random, DataFrames, StatsModels
macro FunctionTermNoBroadcast(fname)
    return :(StatsModels.modelcols(ft::StatsModels.FunctionTerm{typeof($fname),Fa,Names},
                                   d::NamedTuple) where {Fa,Names} =
             ft.fanon(getfield.(Ref(d), Names)...))
end
Random.seed!(12345);
x = collect(range(0.0, length=301, stop=2.0*pi));
y = sin.(x)+Random.randn(length(x)); 
ns1 = Splines2.ns_(x,df=5,intercept=true) # this is a function
@FunctionTermNoBroadcast(ns1)
d = DataFrames.DataFrame(x=x,y=y);
fit2 = GLM.lm(@StatsModels.formula(y~ns1(x)+0),d)
newx = collect(0.0:0.5:3.5);
GLM.predict(fit2, ns1(newx)) # safe predictions
GLM.predict(fit2, DataFrames.DataFrame(x=newx)) # safe predictions!

This avoids having to construct the design matrix, but does require (i) the splines be pre-specified and (ii) that the @FunctionTermNoBroadcast be called for each spline type.

However, I suggest that apply_context may provide a better long-term solution. Note that this is not an easy problem: R and S+ struggled with safe prediction for many years (see the makepredictcall function in R).

Thanks for your kind feedback on the package.

Kindly, Mark.

1 Like

Ideally that kind of macro juggling shouldn’t be necessary. And moreover whatever syntax you put into the formula shouldn’t require knowing anything about the data ahead of time (e.g. the formula terms are meant to be self-contained data transformers that are instantiated based on the schema, not necessarily the full data).

Now that I have a second to look at this a little more closely, it seems like the information bottleneck here is that you use the quantiles of the data to figure out where to put the knots if they’re not specified manually. Are there alternative algorithms for selecting the knots that could make a reasonable fallback? Currently, without some funny business all you’d know is the extrema and the mean/variance of the data vector (StatsModels.jl/schema.jl at master · JuliaStats/StatsModels.jl · GitHub) at the point when apply_schema is called. I opted not to include quantiles in the ContinuousTerm struct since we aim to provide first-class support for streaming/online data where arbitrary quantiles can’t be computed exactly.

Anyway, if you could somehow create the spline basis struct based on the information stored in ContinuousTerms then you could do that in teh apply_schema stage and wrap that information in a custom term struct. One thing that makes this a little awkward at the moment is that the ns_, bs_, etc. functions return a closure which evaluates the spline basis at a given vector. That’s not exactly idiomatic julia code since you could instead rely on dispatch more (like you do with basis already).

1 Like

Dave,

Responding belatedly, there may be two issues with using the data for the design information. First, a data vector could be used twice in a formula, such as a spline for x1 and then a linear interaction between x1 and x2. This suggests different design information would be required for x1 within the same formula. Second, it may be too limiting to pre-specify all possible design information. Any new design function would be constrained by these design choices.

Given these issues: how difficult would it be to design a formula to capture the attributes for the data and for the design functions?

Sincerely, Mark.

That’s a good point. Another idea I was toying with was to create multiple schema entries based on the context a term occurred in.

Do you have an example of the kind of syntax you have in mind here?