I’m building a parametric Bayesian model where I require a design matrix with lots of columns (1 observation per row). The columns represent original measured variables (continuous and categorical) as well as spline basis functions of the original variables and interaction terms between variables.
I’ve been trying to leverage StatsModels
to do this by expressing my model as a sum of Terms objects, and then using that model with schema and data to create the design matrix.
I have a function to do this that, once compiled, does this process very quickly. The problem is that compilation time is 99.9% of the total run time for the first time running this program, and I find myself often needing to do compilation if the model changes at all.
Here is an example:
using BenchmarkTools, StatsModels
# example
function buildmat(NPREDICTORS=10)
# create some test data with a set of continuous predictors and one multi-category predictor
# this will be a short data set (few observations) but potentially very wide (many predictors)
X = rand(6, NPREDICTORS) #test exposure data
# named tuple with data
nt = (y = [0,1,0,1,0,1], z = [0,1,2,3,4,5])
dd = Dict()
for i in 1:size(X,2)
dd[Symbol("x$i")] = X[:,i]
end
dat = merge(nt, (; dd...))
# create model formula programmatically from StatsModels tools
#f(y) = 1 + z + X + z*X
mf = Term(:y) ~ ConstantTerm(1) +
Term(:z) +
sum(term.((Symbol("x$i")) for i in 1:NPREDICTORS)) +
sum(InteractionTerm(term.((:z, Symbol("x$i")))) for i in 1:NPREDICTORS)
# create schema hints (categorical variable coding or continuous term notes)
myhints = Dict{Symbol, Any}(
:z => DummyCoding(base=0, levels=[0,1,2,3,4,5]),
:y => DummyCoding(base=0, levels=[0,1])
)
for trm in [Symbol("x$i") for i in 1:size(X,2)]
if !haskey(myhints, trm )
myhints[trm] = ContinuousTerm
end
end
# create design matrix testX
myschema = schema(dat, myhints)
f = apply_schema(mf, myschema)
testX = hcat(modelcols(f, dat)...)
end
The problem is that the compiler is triggered every time I change anything about the model. Here I’m changing the underlying data, which makes the function call trigger the compiler each time. As you can see, this leads to a painful workflow where I have to wait 3-5 minutes just for a fairly small matrix to be made (a 6x606 matrix takes 170 seconds!), while a bespoke solution could rip through this. I like using the StatsModels interface because it simplifies the coding here, but it has a lot of unexpected overhead.
# first time run (compilation time is important to my work flow)
@time buildmat(10)
# 2.024731 seconds (7.50 M allocations: 413.870 MiB, 2.38% gc time, 99.71% compilation time)
@time buildmat(50)
# 19.502389 seconds (6.72 M allocations: 386.869 MiB, 0.21% gc time, 99.95% compilation time)
@time buildmat(100)
# 167.304433 seconds (9.58 M allocations: 650.644 MiB, 0.04% gc time, 99.98% compilation time)
This is quick once the function is compiled, but that really doesn’t help me much here.
@btime buildmat(10); # n = (6*11) : 6x66 Matrix
# 68.000 μs (1901 allocations: 315.00 KiB)
@btime buildmat(50); # n = (6*51) : 6x306 Matrix
# 1.841 ms (25455 allocations: 17.95 MiB)
@btime buildmat(100); # n = (6*101) : 6x606 Matrix
# 18.540 ms (130093 allocations: 129.93 MiB)
Is there something I can do differently that will cut the compilation time or allow me to compile on a small problem and then apply the fast, compiled function to a larger problem?
I should also mention that in my actual workflow, I separate out the data creation from the model formula/design matrix creation, and the compilation problems remain.