# Safe predictions using `@formula` and regression splines

Dear all,

As my first effort in Julia, I have written a `Splines2` package for spline bases for regression modelling (https://github.com/mclements/Splines2.jl). The implementation includes equivalents to the `splines::ns`, `splines::bs`, `splines2::mSpline` and `splines2::iSpline` functions from R, with names `Splines2.ns`, `Splines2.bs`, `Splines2.ms` and `Splines2.is`, respectively. These take an argument `x=Array{T<:Real,1}` and design information and evaluate the spline basis at `x`. I have also written functions that return functions for evaluating the splines at new `x` values, with names `Splines2.ns_`, `Splines2.bs_`, `Splines2.ms_` and `Splines2.is_`, respectively. Using the latter functions and constructing the design matrix ourselves, we can get “safe” predictions on new data:

``````    using Splines2, GLM, Random
Random.seed!(12345);
x = collect(range(0.0, length=301, stop=2.0*pi));
y = sin.(x)+randn(length(x));
ns1 = Splines2.ns_(x,df=5,intercept=true) # this is a function
X = ns1(x);
fit1 = lm(X,y)
newx = collect(0.0:0.5:3.5);
predict(fit1, ns1(newx)) # safe predictions
``````

However, I am uncertain whether `@formula` from StatsModels will allow for “safe” predictions on new data. Adapting the `poly` example from https://juliastats.org/StatsModels.jl/latest/internals/, I used:

``````    using StatsModels
ns(x,df) = Splines2.ns(x,df=df,intercept=true) # assumes intercept
const NSPLINE_CONTEXT = Any
struct NSplineTerm{T,D} <: AbstractTerm
term::T
df::D
end
Base.show(io::IO, p::NSplineTerm) = print(io, "ns(\$(p.term), \$(p.df))")
function StatsModels.apply_schema(t::FunctionTerm{typeof(ns)},
sch::StatsModels.Schema,
Mod::Type{<:NSPLINE_CONTEXT})
apply_schema(NSplineTerm(t.args_parsed...), sch, Mod)
end
function StatsModels.apply_schema(t::NSplineTerm,
sch::StatsModels.Schema,
Mod::Type{<:NSPLINE_CONTEXT})
term = apply_schema(t.term, sch, Mod)
isa(term, ContinuousTerm) ||
throw(ArgumentError("NSplineTerm only works with continuous terms (got \$term)"))
isa(t.df, ConstantTerm) ||
throw(ArgumentError("NSplineTerm df must be a number (got \$(t.df))"))
NSplineTerm(term, t.df.n)
end
function StatsModels.modelcols(p::NSplineTerm, d::NamedTuple)
col = modelcols(p.term, d)
Splines2.ns(col, df=p.df,intercept=true)
end
StatsModels.terms(p::NSplineTerm) = terms(p.term)
StatsModels.termvars(p::NSplineTerm) = StatsModels.termvars(p.term)
StatsModels.width(p::NSplineTerm) = 1
StatsModels.coefnames(p::NSplineTerm) = "ns(" .* coefnames(p.term) .* "," .* string.(1:p.df) .* ")"
``````

For the above example, I then get:

``````    d = DataFrames.DataFrame(x=x,y=y);
fit2 = lm(@formula(y~ns(x,5)+0),d) # equivalent to fit1 with nicer labels
predict(fit2, ns1(newx)) # safe predictions
predict(fit2, DataFrames.DataFrame(x=newx)) # unsafe predictions!
``````

The issue relates to the design information (in this case, the knots) not being carried with the formula. We could provide the knots information as a hint to the schema for `:x`, but `schema` is focused on the individual `Term` values in the data, rather than the `AbstractTerm` values in the formula. To do this automatically, we would need to construct the hints in a manner similar to the `terms` function. The design information has been lost when `apply_schema` (which does use `AbstractTerm` values) is called. Is this a design constraint for formula 2.0 – or have I misinterpreted `schema.jl` and `terms.jl`?

Sincerely, Mark Clements.

3 Likes

I think what you want to do here is store the knots inside the `NSplineTerm` itself (instead of just the degrees of freedom). Generally, the concrete term should contain every thing necessary to generate the model columns. It looks like the `ns_` function in your first example provides that kind of functionality so why not just store the output of that?

One potential issue I can see is that you’re passing the full x vector to `ns_` whereas only the invariants from the schema are available when you’re constructing the term. But that’s why the continuous term type stores the min, max, mean, and std. If that’s not enough information, then you might consider calculating the knots etc lazily (eg, have a field on few term that you initialize with some placeholder like nothing, and compute when modelcols is called and the value is nothing…but that might have performance issues).

As a more extreme possibility, I’d thought about adding an additional `apply_context` stage that happens before the schema so you could capture custom information in the schema. That would be helpful but not necessary in the things I’ve seen so far but this case might be a good argument for it…

BTW I think this package looks like a very useful addition to the julia stats ecosystem so thanks for bringing it up here! People have been grumbling about spline/basis regression for a while so I’m glad to see someone has just gone and done it 1 Like

Dave,

Following your suggestion of using the `ns_` return value directly, we have the following working solution:

``````import Splines2, GLM, Random, DataFrames, StatsModels
return :(StatsModels.modelcols(ft::StatsModels.FunctionTerm{typeof(\$fname),Fa,Names},
d::NamedTuple) where {Fa,Names} =
ft.fanon(getfield.(Ref(d), Names)...))
end
Random.seed!(12345);
x = collect(range(0.0, length=301, stop=2.0*pi));
y = sin.(x)+Random.randn(length(x));
ns1 = Splines2.ns_(x,df=5,intercept=true) # this is a function
d = DataFrames.DataFrame(x=x,y=y);
fit2 = GLM.lm(@StatsModels.formula(y~ns1(x)+0),d)
newx = collect(0.0:0.5:3.5);
GLM.predict(fit2, ns1(newx)) # safe predictions
GLM.predict(fit2, DataFrames.DataFrame(x=newx)) # safe predictions!
``````

This avoids having to construct the design matrix, but does require (i) the splines be pre-specified and (ii) that the @FunctionTermNoBroadcast be called for each spline type.

However, I suggest that `apply_context` may provide a better long-term solution. Note that this is not an easy problem: R and S+ struggled with safe prediction for many years (see the `makepredictcall` function in R).

Thanks for your kind feedback on the package.

Kindly, Mark.

1 Like

Ideally that kind of macro juggling shouldn’t be necessary. And moreover whatever syntax you put into the formula shouldn’t require knowing anything about the data ahead of time (e.g. the formula terms are meant to be self-contained data transformers that are instantiated based on the schema, not necessarily the full data).

Now that I have a second to look at this a little more closely, it seems like the information bottleneck here is that you use the quantiles of the data to figure out where to put the knots if they’re not specified manually. Are there alternative algorithms for selecting the knots that could make a reasonable fallback? Currently, without some funny business all you’d know is the extrema and the mean/variance of the data vector (https://github.com/JuliaStats/StatsModels.jl/blob/master/src/schema.jl#L179-L183) at the point when `apply_schema` is called. I opted not to include quantiles in the `ContinuousTerm` struct since we aim to provide first-class support for streaming/online data where arbitrary quantiles can’t be computed exactly.

Anyway, if you could somehow create the spline basis struct based on the information stored in `ContinuousTerm`s then you could do that in teh `apply_schema` stage and wrap that information in a custom term struct. One thing that makes this a little awkward at the moment is that the `ns_`, `bs_`, etc. functions return a closure which evaluates the spline basis at a given vector. That’s not exactly idiomatic julia code since you could instead rely on dispatch more (like you do with `basis` already).

1 Like

Dave,

Responding belatedly, there may be two issues with using the data for the design information. First, a data vector could be used twice in a formula, such as a spline for x1 and then a linear interaction between x1 and x2. This suggests different design information would be required for x1 within the same formula. Second, it may be too limiting to pre-specify all possible design information. Any new design function would be constrained by these design choices.

Given these issues: how difficult would it be to design a formula to capture the attributes for the data and for the design functions?

Sincerely, Mark.

That’s a good point. Another idea I was toying with was to create multiple schema entries based on the context a term occurred in.

Do you have an example of the kind of syntax you have in mind here?