Dear Julia friends, I am benchmarking Julia for linear regression models and trying to optimize that for high-dimensional spaces. Any suggestions on the code below to reduce allocations and speed up the code? Thank you so much

using Random
using GLM
using DataFrames
using BenchmarkTools
function benchmark(dimensions::Int)
Random.seed!(1234)
X = randn(100000, dimensions)
β = randn(dimensions)
y = X * β + 0.1 * randn(100000)
# Avoiding column name generation
colnames = [Symbol("x$i") for i in 1:dimensions]
push!(colnames, :y)
data = DataFrame(hcat(X, y), colnames) # Creating DataFrame in one step to avoid extra allocations
formula = @inbounds Term(:y) ~ sum(Term(Symbol("x$i")) for i in 1:dimensions)
GLM.lm(formula, data)
end
for dims in [10, 100, 1000, 10000]
println("Warming up for dimension $dims")
benchmark(dims) # Warm-up run
println("Benchmarking for dimension $dims")
@btime benchmark($dims) # Using BenchmarkTools.@btime for more accurate timing and allocation measurement
end

I just profiled the code with @profview benchmark(1000) and on my computer, this call spent roughly

50% of the time in lm, which is split pretty evenly between ModelMatrix, ModelFrame and fit

30% of the time in DataFrame (constructing data)

20% of the time in randn (generating X)

One would assume the fit method does the actual work, everything else is just moving stuff around in memory. What is the context of this function? Will you call it several times? Can you reuse X or data?

Have you considered just skipping the dataframe and GLM stuff and just doing fit = X \ y directly? That’s all it’s doing in the end.

(Reducing it to X \ y lets you further optimize things by customizing the method of the linear solve, doing things in-place, etcetera. It may also trim off a bunch of overhead.)

You might want to try FixedEffectModels.jl, which also does linear regression but may be faster.

Also, your benchmark includes creating the DataFrame. You should split the data creation and estimation process into separate pieces to best analyze the performance of the estimation.

fit or X \ y complexity also scale differently from other operations (cubic in dimension, IIRC), and therefore testing benchmark(10_000) should spend most time there and optimizing the rest will be less crucial.
Having said this, for smaller problems it is always nice to reduce overhead.
One would want to minimize copying of the data (as just generating it takes 20% with dimension 1_000). Creating the DataFrame with views to columns might help, by adding copycols=false like so:

data = DataFrame(hcat(X, y), colnames; copycols=false)

and maybe to avoid hcat copying, using some in-place versions:

function benchmark(dimensions::Int)
Random.seed!(1234)
D = Matrix{Float64}(undef, 100_000, dimensions+1)
randn!(@view D[:,1:end-1])
β = randn(dimensions)
@views D[:,end] .= D[:,1:end-1] * β + 0.1 * randn(100000)
# Avoiding column name generation
colnames = [Symbol("x$i") for i in 1:dimensions]
push!(colnames, :y)
data = DataFrame(D, colnames; copycols=false) # Creating DataFrame in one step to avoid extra allocations
...

This discussion was had before, much of it is likely unchanged:

I agree with Peter, if you are fitting models with high dimensional fixed effects look at FixedEffectModels which is designed for this and also has GPU support.

The R world has a nice separation between a small lm.fit() which is fast and the full lm() which does more things. We should think similarly.

Mousum Dutta has been building a QR-decomposition based engine for GLM.jl which is more precise. Here also, we see the tradeoffs between a fast and precise method. It’s good to have both in the implementation and in the function design.