Dear all,
As my first effort in Julia, I have written a Splines2
package for spline bases for regression modelling (https://github.com/mclements/Splines2.jl). The implementation includes equivalents to the splines::ns
, splines::bs
, splines2::mSpline
and splines2::iSpline
functions from R, with names Splines2.ns
, Splines2.bs
, Splines2.ms
and Splines2.is
, respectively. These take an argument x=Array{T<:Real,1}
and design information and evaluate the spline basis at x
. I have also written functions that return functions for evaluating the splines at new x
values, with names Splines2.ns_
, Splines2.bs_
, Splines2.ms_
and Splines2.is_
, respectively. Using the latter functions and constructing the design matrix ourselves, we can get “safe” predictions on new data:
using Splines2, GLM, Random
Random.seed!(12345);
x = collect(range(0.0, length=301, stop=2.0*pi));
y = sin.(x)+randn(length(x));
ns1 = Splines2.ns_(x,df=5,intercept=true) # this is a function
X = ns1(x);
fit1 = lm(X,y)
newx = collect(0.0:0.5:3.5);
predict(fit1, ns1(newx)) # safe predictions
However, I am uncertain whether @formula
from StatsModels will allow for “safe” predictions on new data. Adapting the poly
example from https://juliastats.org/StatsModels.jl/latest/internals/, I used:
using StatsModels
ns(x,df) = Splines2.ns(x,df=df,intercept=true) # assumes intercept
const NSPLINE_CONTEXT = Any
struct NSplineTerm{T,D} <: AbstractTerm
term::T
df::D
end
Base.show(io::IO, p::NSplineTerm) = print(io, "ns($(p.term), $(p.df))")
function StatsModels.apply_schema(t::FunctionTerm{typeof(ns)},
sch::StatsModels.Schema,
Mod::Type{<:NSPLINE_CONTEXT})
apply_schema(NSplineTerm(t.args_parsed...), sch, Mod)
end
function StatsModels.apply_schema(t::NSplineTerm,
sch::StatsModels.Schema,
Mod::Type{<:NSPLINE_CONTEXT})
term = apply_schema(t.term, sch, Mod)
isa(term, ContinuousTerm) ||
throw(ArgumentError("NSplineTerm only works with continuous terms (got $term)"))
isa(t.df, ConstantTerm) ||
throw(ArgumentError("NSplineTerm df must be a number (got $(t.df))"))
NSplineTerm(term, t.df.n)
end
function StatsModels.modelcols(p::NSplineTerm, d::NamedTuple)
col = modelcols(p.term, d)
Splines2.ns(col, df=p.df,intercept=true)
end
StatsModels.terms(p::NSplineTerm) = terms(p.term)
StatsModels.termvars(p::NSplineTerm) = StatsModels.termvars(p.term)
StatsModels.width(p::NSplineTerm) = 1
StatsModels.coefnames(p::NSplineTerm) = "ns(" .* coefnames(p.term) .* "," .* string.(1:p.df) .* ")"
For the above example, I then get:
d = DataFrames.DataFrame(x=x,y=y);
fit2 = lm(@formula(y~ns(x,5)+0),d) # equivalent to fit1 with nicer labels
predict(fit2, ns1(newx)) # safe predictions
predict(fit2, DataFrames.DataFrame(x=newx)) # unsafe predictions!
The issue relates to the design information (in this case, the knots) not being carried with the formula. We could provide the knots information as a hint to the schema for :x
, but schema
is focused on the individual Term
values in the data, rather than the AbstractTerm
values in the formula. To do this automatically, we would need to construct the hints in a manner similar to the terms
function. The design information has been lost when apply_schema
(which does use AbstractTerm
values) is called. Is this a design constraint for formula 2.0 – or have I misinterpreted schema.jl
and terms.jl
?
Sincerely, Mark Clements.