Accelerating linear methods

Dear Julia friends, I am benchmarking Julia for linear regression models and trying to optimize that for high-dimensional spaces. Any suggestions on the code below to reduce allocations and speed up the code? Thank you so much

using Random
using GLM
using DataFrames
using BenchmarkTools

function benchmark(dimensions::Int)
    Random.seed!(1234)
    X = randn(100000, dimensions)
    β = randn(dimensions)
    y = X * β + 0.1 * randn(100000)

    # Avoiding column name generation
    colnames = [Symbol("x$i") for i in 1:dimensions]
    push!(colnames, :y)
    data = DataFrame(hcat(X, y), colnames)  # Creating DataFrame in one step to avoid extra allocations

    formula = @inbounds Term(:y) ~ sum(Term(Symbol("x$i")) for i in 1:dimensions)

    GLM.lm(formula, data)
end

for dims in [10, 100, 1000, 10000]
    println("Warming up for dimension $dims")
    benchmark(dims)  # Warm-up run

    println("Benchmarking for dimension $dims")
    @btime benchmark($dims)  # Using BenchmarkTools.@btime for more accurate timing and allocation measurement
end

I just profiled the code with @profview benchmark(1000) and on my computer, this call spent roughly

  • 50% of the time in lm, which is split pretty evenly between ModelMatrix, ModelFrame and fit
  • 30% of the time in DataFrame (constructing data)
  • 20% of the time in randn (generating X)

One would assume the fit method does the actual work, everything else is just moving stuff around in memory. What is the context of this function? Will you call it several times? Can you reuse X or data?

Have you considered just skipping the dataframe and GLM stuff and just doing fit = X \ y directly? That’s all it’s doing in the end.

(Reducing it to X \ y lets you further optimize things by customizing the method of the linear solve, doing things in-place, etcetera. It may also trim off a bunch of overhead.)

lm also calculates other values like t-statistics etc. It would be cool to have more control over the solver though.

1 Like

You might want to try FixedEffectModels.jl, which also does linear regression but may be faster.

Also, your benchmark includes creating the DataFrame. You should split the data creation and estimation process into separate pieces to best analyze the performance of the estimation.

2 Likes

Yes, I am actually calling it several times (like thousands), so any small improvement is critical

fit or X \ y complexity also scale differently from other operations (cubic in dimension, IIRC), and therefore testing benchmark(10_000) should spend most time there and optimizing the rest will be less crucial.
Having said this, for smaller problems it is always nice to reduce overhead.
One would want to minimize copying of the data (as just generating it takes 20% with dimension 1_000). Creating the DataFrame with views to columns might help, by adding copycols=false like so:

data = DataFrame(hcat(X, y), colnames; copycols=false)

and maybe to avoid hcat copying, using some in-place versions:

function benchmark(dimensions::Int)
    Random.seed!(1234)
    D = Matrix{Float64}(undef, 100_000, dimensions+1)
    randn!(@view D[:,1:end-1])
    β = randn(dimensions)
    @views D[:,end] .= D[:,1:end-1] * β + 0.1 * randn(100000)

    # Avoiding column name generation
    colnames = [Symbol("x$i") for i in 1:dimensions]
    push!(colnames, :y)
    data = DataFrame(D, colnames; copycols=false)  # Creating DataFrame in one step to avoid extra allocations
...

Do you always have the same X? or the same y? Do they at least keep the same dimensions?

This discussion was had before, much of it is likely unchanged:

I agree with Peter, if you are fitting models with high dimensional fixed effects look at FixedEffectModels which is designed for this and also has GPU support.

The R world has a nice separation between a small lm.fit() which is fast and the full lm() which does more things. We should think similarly.

Mousum Dutta has been building a QR-decomposition based engine for GLM.jl which is more precise. Here also, we see the tradeoffs between a fast and precise method. It’s good to have both in the implementation and in the function design.

They have the same dimension: I have posted a more detailed code here: Error when processing data from file for linear regression

1 Like

That would be great, we would have more control