# Safe predictions using `@formula` and regression splines

Dear all,

As my first effort in Julia, I have written a `Splines2` package for spline bases for regression modelling (https://github.com/mclements/Splines2.jl). The implementation includes equivalents to the `splines::ns`, `splines::bs`, `splines2::mSpline` and `splines2::iSpline` functions from R, with names `Splines2.ns`, `Splines2.bs`, `Splines2.ms` and `Splines2.is`, respectively. These take an argument `x=Array{T<:Real,1}` and design information and evaluate the spline basis at `x`. I have also written functions that return functions for evaluating the splines at new `x` values, with names `Splines2.ns_`, `Splines2.bs_`, `Splines2.ms_` and `Splines2.is_`, respectively. Using the latter functions and constructing the design matrix ourselves, we can get â€śsafeâ€ť predictions on new data:

``````    using Splines2, GLM, Random
Random.seed!(12345);
x = collect(range(0.0, length=301, stop=2.0*pi));
y = sin.(x)+randn(length(x));
ns1 = Splines2.ns_(x,df=5,intercept=true) # this is a function
X = ns1(x);
fit1 = lm(X,y)
newx = collect(0.0:0.5:3.5);
predict(fit1, ns1(newx)) # safe predictions
``````

However, I am uncertain whether `@formula` from StatsModels will allow for â€śsafeâ€ť predictions on new data. Adapting the `poly` example from https://juliastats.org/StatsModels.jl/latest/internals/, I used:

``````    using StatsModels
ns(x,df) = Splines2.ns(x,df=df,intercept=true) # assumes intercept
const NSPLINE_CONTEXT = Any
struct NSplineTerm{T,D} <: AbstractTerm
term::T
df::D
end
Base.show(io::IO, p::NSplineTerm) = print(io, "ns(\$(p.term), \$(p.df))")
function StatsModels.apply_schema(t::FunctionTerm{typeof(ns)},
sch::StatsModels.Schema,
Mod::Type{<:NSPLINE_CONTEXT})
apply_schema(NSplineTerm(t.args_parsed...), sch, Mod)
end
function StatsModels.apply_schema(t::NSplineTerm,
sch::StatsModels.Schema,
Mod::Type{<:NSPLINE_CONTEXT})
term = apply_schema(t.term, sch, Mod)
isa(term, ContinuousTerm) ||
throw(ArgumentError("NSplineTerm only works with continuous terms (got \$term)"))
isa(t.df, ConstantTerm) ||
throw(ArgumentError("NSplineTerm df must be a number (got \$(t.df))"))
NSplineTerm(term, t.df.n)
end
function StatsModels.modelcols(p::NSplineTerm, d::NamedTuple)
col = modelcols(p.term, d)
Splines2.ns(col, df=p.df,intercept=true)
end
StatsModels.terms(p::NSplineTerm) = terms(p.term)
StatsModels.termvars(p::NSplineTerm) = StatsModels.termvars(p.term)
StatsModels.width(p::NSplineTerm) = 1
StatsModels.coefnames(p::NSplineTerm) = "ns(" .* coefnames(p.term) .* "," .* string.(1:p.df) .* ")"
``````

For the above example, I then get:

``````    d = DataFrames.DataFrame(x=x,y=y);
fit2 = lm(@formula(y~ns(x,5)+0),d) # equivalent to fit1 with nicer labels
predict(fit2, ns1(newx)) # safe predictions
predict(fit2, DataFrames.DataFrame(x=newx)) # unsafe predictions!
``````

The issue relates to the design information (in this case, the knots) not being carried with the formula. We could provide the knots information as a hint to the schema for `:x`, but `schema` is focused on the individual `Term` values in the data, rather than the `AbstractTerm` values in the formula. To do this automatically, we would need to construct the hints in a manner similar to the `terms` function. The design information has been lost when `apply_schema` (which does use `AbstractTerm` values) is called. Is this a design constraint for formula 2.0 â€“ or have I misinterpreted `schema.jl` and `terms.jl`?

Sincerely, Mark Clements.

3 Likes

I think what you want to do here is store the knots inside the `NSplineTerm` itself (instead of just the degrees of freedom). Generally, the concrete term should contain every thing necessary to generate the model columns. It looks like the `ns_` function in your first example provides that kind of functionality so why not just store the output of that?

One potential issue I can see is that youâ€™re passing the full x vector to `ns_` whereas only the invariants from the schema are available when youâ€™re constructing the term. But thatâ€™s why the continuous term type stores the min, max, mean, and std. If thatâ€™s not enough information, then you might consider calculating the knots etc lazily (eg, have a field on few term that you initialize with some placeholder like nothing, and compute when modelcols is called and the value is nothingâ€¦but that might have performance issues).

As a more extreme possibility, Iâ€™d thought about adding an additional `apply_context` stage that happens before the schema so you could capture custom information in the schema. That would be helpful but not necessary in the things Iâ€™ve seen so far but this case might be a good argument for itâ€¦

BTW I think this package looks like a very useful addition to the julia stats ecosystem so thanks for bringing it up here! People have been grumbling about spline/basis regression for a while so Iâ€™m glad to see someone has just gone and done it

1 Like

Dave,

Following your suggestion of using the `ns_` return value directly, we have the following working solution:

``````import Splines2, GLM, Random, DataFrames, StatsModels
return :(StatsModels.modelcols(ft::StatsModels.FunctionTerm{typeof(\$fname),Fa,Names},
d::NamedTuple) where {Fa,Names} =
ft.fanon(getfield.(Ref(d), Names)...))
end
Random.seed!(12345);
x = collect(range(0.0, length=301, stop=2.0*pi));
y = sin.(x)+Random.randn(length(x));
ns1 = Splines2.ns_(x,df=5,intercept=true) # this is a function
d = DataFrames.DataFrame(x=x,y=y);
fit2 = GLM.lm(@StatsModels.formula(y~ns1(x)+0),d)
newx = collect(0.0:0.5:3.5);
GLM.predict(fit2, ns1(newx)) # safe predictions
GLM.predict(fit2, DataFrames.DataFrame(x=newx)) # safe predictions!
``````

This avoids having to construct the design matrix, but does require (i) the splines be pre-specified and (ii) that the @FunctionTermNoBroadcast be called for each spline type.

However, I suggest that `apply_context` may provide a better long-term solution. Note that this is not an easy problem: R and S+ struggled with safe prediction for many years (see the `makepredictcall` function in R).

Thanks for your kind feedback on the package.

Kindly, Mark.

1 Like

Ideally that kind of macro juggling shouldnâ€™t be necessary. And moreover whatever syntax you put into the formula shouldnâ€™t require knowing anything about the data ahead of time (e.g. the formula terms are meant to be self-contained data transformers that are instantiated based on the schema, not necessarily the full data).

Now that I have a second to look at this a little more closely, it seems like the information bottleneck here is that you use the quantiles of the data to figure out where to put the knots if theyâ€™re not specified manually. Are there alternative algorithms for selecting the knots that could make a reasonable fallback? Currently, without some funny business all youâ€™d know is the extrema and the mean/variance of the data vector (https://github.com/JuliaStats/StatsModels.jl/blob/master/src/schema.jl#L179-L183) at the point when `apply_schema` is called. I opted not to include quantiles in the `ContinuousTerm` struct since we aim to provide first-class support for streaming/online data where arbitrary quantiles canâ€™t be computed exactly.

Anyway, if you could somehow create the spline basis struct based on the information stored in `ContinuousTerm`s then you could do that in teh `apply_schema` stage and wrap that information in a custom term struct. One thing that makes this a little awkward at the moment is that the `ns_`, `bs_`, etc. functions return a closure which evaluates the spline basis at a given vector. Thatâ€™s not exactly idiomatic julia code since you could instead rely on dispatch more (like you do with `basis` already).

1 Like

Dave,

Responding belatedly, there may be two issues with using the data for the design information. First, a data vector could be used twice in a formula, such as a spline for x1 and then a linear interaction between x1 and x2. This suggests different design information would be required for x1 within the same formula. Second, it may be too limiting to pre-specify all possible design information. Any new design function would be constrained by these design choices.

Given these issues: how difficult would it be to design a formula to capture the attributes for the data and for the design functions?

Sincerely, Mark.

Thatâ€™s a good point. Another idea I was toying with was to create multiple schema entries based on the context a term occurred in.

Do you have an example of the kind of syntax you have in mind here?