Dear all,
As my first effort in Julia, I have written a Splines2 package for spline bases for regression modelling (https://github.com/mclements/Splines2.jl). The implementation includes equivalents to the splines::ns, splines::bs, splines2::mSpline and splines2::iSpline functions from R, with names Splines2.ns, Splines2.bs, Splines2.ms and Splines2.is, respectively. These take an argument x=Array{T<:Real,1} and design information and evaluate the spline basis at x. I have also written functions that return functions for evaluating the splines at new x values, with names Splines2.ns_, Splines2.bs_, Splines2.ms_ and Splines2.is_, respectively. Using the latter functions and constructing the design matrix ourselves, we can get “safe” predictions on new data:
using Splines2, GLM, Random
Random.seed!(12345);
x = collect(range(0.0, length=301, stop=2.0*pi));
y = sin.(x)+randn(length(x));
ns1 = Splines2.ns_(x,df=5,intercept=true) # this is a function
X = ns1(x);
fit1 = lm(X,y)
newx = collect(0.0:0.5:3.5);
predict(fit1, ns1(newx)) # safe predictions
However, I am uncertain whether @formula from StatsModels will allow for “safe” predictions on new data. Adapting the poly example from https://juliastats.org/StatsModels.jl/latest/internals/, I used:
using StatsModels
ns(x,df) = Splines2.ns(x,df=df,intercept=true) # assumes intercept
const NSPLINE_CONTEXT = Any
struct NSplineTerm{T,D} <: AbstractTerm
term::T
df::D
end
Base.show(io::IO, p::NSplineTerm) = print(io, "ns($(p.term), $(p.df))")
function StatsModels.apply_schema(t::FunctionTerm{typeof(ns)},
sch::StatsModels.Schema,
Mod::Type{<:NSPLINE_CONTEXT})
apply_schema(NSplineTerm(t.args_parsed...), sch, Mod)
end
function StatsModels.apply_schema(t::NSplineTerm,
sch::StatsModels.Schema,
Mod::Type{<:NSPLINE_CONTEXT})
term = apply_schema(t.term, sch, Mod)
isa(term, ContinuousTerm) ||
throw(ArgumentError("NSplineTerm only works with continuous terms (got $term)"))
isa(t.df, ConstantTerm) ||
throw(ArgumentError("NSplineTerm df must be a number (got $(t.df))"))
NSplineTerm(term, t.df.n)
end
function StatsModels.modelcols(p::NSplineTerm, d::NamedTuple)
col = modelcols(p.term, d)
Splines2.ns(col, df=p.df,intercept=true)
end
StatsModels.terms(p::NSplineTerm) = terms(p.term)
StatsModels.termvars(p::NSplineTerm) = StatsModels.termvars(p.term)
StatsModels.width(p::NSplineTerm) = 1
StatsModels.coefnames(p::NSplineTerm) = "ns(" .* coefnames(p.term) .* "," .* string.(1:p.df) .* ")"
For the above example, I then get:
d = DataFrames.DataFrame(x=x,y=y);
fit2 = lm(@formula(y~ns(x,5)+0),d) # equivalent to fit1 with nicer labels
predict(fit2, ns1(newx)) # safe predictions
predict(fit2, DataFrames.DataFrame(x=newx)) # unsafe predictions!
The issue relates to the design information (in this case, the knots) not being carried with the formula. We could provide the knots information as a hint to the schema for :x, but schema is focused on the individual Term values in the data, rather than the AbstractTerm values in the formula. To do this automatically, we would need to construct the hints in a manner similar to the terms function. The design information has been lost when apply_schema (which does use AbstractTerm values) is called. Is this a design constraint for formula 2.0 – or have I misinterpreted schema.jl and terms.jl?
Sincerely, Mark Clements.