Can you help me find a speedier way to make a design matrix for high parameter model?

I’m building a parametric Bayesian model where I require a design matrix with lots of columns (1 observation per row). The columns represent original measured variables (continuous and categorical) as well as spline basis functions of the original variables and interaction terms between variables.

I’ve been trying to leverage StatsModels to do this by expressing my model as a sum of Terms objects, and then using that model with schema and data to create the design matrix.

I have a function to do this that, once compiled, does this process very quickly. The problem is that compilation time is 99.9% of the total run time for the first time running this program, and I find myself often needing to do compilation if the model changes at all.

Here is an example:

using BenchmarkTools, StatsModels

#  example

function buildmat(NPREDICTORS=10)
   # create some test data with a set of continuous predictors and one multi-category predictor
   # this will be a short data set (few observations) but potentially very wide (many predictors)
   X = rand(6, NPREDICTORS)    #test exposure data
   # named tuple with data
   nt = (y = [0,1,0,1,0,1], z = [0,1,2,3,4,5])
   dd = Dict()
   for i in 1:size(X,2)
     dd[Symbol("x$i")] = X[:,i]
   end
   dat = merge(nt, (; dd...))
   
   
   # create model formula programmatically from StatsModels tools
   #f(y) = 1 + z + X + z*X
   mf = Term(:y) ~ ConstantTerm(1) + 
                   Term(:z) + 
                   sum(term.((Symbol("x$i")) for i in 1:NPREDICTORS)) + 
                   sum(InteractionTerm(term.((:z, Symbol("x$i")))) for i in 1:NPREDICTORS)
   # create schema hints (categorical variable coding or continuous term notes)
   myhints = Dict{Symbol, Any}(
       :z => DummyCoding(base=0, levels=[0,1,2,3,4,5]),
       :y => DummyCoding(base=0, levels=[0,1])
       )
   for trm in [Symbol("x$i") for i in 1:size(X,2)]
     if !haskey(myhints, trm )
       myhints[trm] = ContinuousTerm 
     end
   end
   # create design matrix testX
   myschema = schema(dat, myhints)
   f      = apply_schema(mf, myschema)
   testX     = hcat(modelcols(f, dat)...)
end

The problem is that the compiler is triggered every time I change anything about the model. Here I’m changing the underlying data, which makes the function call trigger the compiler each time. As you can see, this leads to a painful workflow where I have to wait 3-5 minutes just for a fairly small matrix to be made (a 6x606 matrix takes 170 seconds!), while a bespoke solution could rip through this. I like using the StatsModels interface because it simplifies the coding here, but it has a lot of unexpected overhead.

# first time run (compilation time is important to my work flow)
@time buildmat(10)
#  2.024731 seconds (7.50 M allocations: 413.870 MiB, 2.38% gc time, 99.71% compilation time)
@time buildmat(50)
# 19.502389 seconds (6.72 M allocations: 386.869 MiB, 0.21% gc time, 99.95% compilation time)
@time buildmat(100)
# 167.304433 seconds (9.58 M allocations: 650.644 MiB, 0.04% gc time, 99.98% compilation time)

This is quick once the function is compiled, but that really doesn’t help me much here.

@btime buildmat(10); # n = (6*11) : 6x66 Matrix
#  68.000 μs (1901 allocations: 315.00 KiB)

@btime buildmat(50); # n = (6*51) : 6x306  Matrix
#  1.841 ms (25455 allocations: 17.95 MiB)

@btime buildmat(100); # n = (6*101) : 6x606  Matrix
#  18.540 ms (130093 allocations: 129.93 MiB)

Is there something I can do differently that will cut the compilation time or allow me to compile on a small problem and then apply the fast, compiled function to a larger problem?

I should also mention that in my actual workflow, I separate out the data creation from the model formula/design matrix creation, and the compilation problems remain.

2 Likes

I don’t know StatsModels at all, but I see a few standard performance pitfalls.

You have some poorly typed Dicts: dd is a Dict{Any, Any} but should be completely concretely typed. myhints also has Any for values, can this be narrowed?

Splatting is a classic performance trap:

Perhaps reduce(hcat, etc) can help you, or some other technique.

I’m not sure if these make up a significant part of your compile time, but the non-linear increase with size indicates that they might.

1 Like

Most of the compilation time is spent here:

Could this be an issue of StatsModels?

Edit: the type of the terms mf get rather complex…

1 Like

I just realized that you are making a gigantic named tuple, and dragging that through your code.

Tuples (and by extension, named tuples) are heavy for the compiler. You are making huge tuples that are dynamically sized (size not known at compile time.) Tuples should be short (I rarely use more than 3-4 elements, though I guess they could be quite a bit longer) and statically sized.

Can you use a Dict instead?

2 Likes

Rather than this, I think the issue is the call to apply_schema with a very large model formula. Commenting out the last two lines of the example function lead to a compilation of just ~2 seconds with NPREDICTORS=100 when I tried.

1 Like

But those functions depend on gigantic tuples, and with tuples, the whole thing needs to recompile each time you change size.

As I said, I know nothing about StatsModels, but I know that big tuples are bad. Perhaps they also have a knock-on effect for the compilation of apply_schema?

What else can be done?

1 Like

It’s a known issue that StatsModels has high compilation costs currently:
https://github.com/JuliaStats/StatsModels.jl/issues/201

And I’m afraid it’s not been optimized at all for such a large number of variables.

Maybe you’d better compute matrix columns manually, possibly based on the matrix that StatsModels generates from an elementary pattern.

2 Likes

Dang. Thanks for pointing this out.

If you’re feeling adventurous, have a look at https://github.com/JuliaStats/StatsModels.jl/pull/247 and/or https://github.com/JuliaStats/StatsModels.jl/pull/183 (that later one probably only is relevant if you have functionterms actually).

The other thing that’s triggering more specialization than necessary is that we currently track the types of children terms in containers like FormulaTerm, MatrixTerm, etc. as tuples which are included as type parameters on the container. So that really hammers the compiler. on the roadmap is replacing TupleTerm with a MultiTerm which could hold any kind of iterable and may reduce speicalization if you use a vector or Set to hold the children.

2 Likes