GLM is slow on large datasets. Using OnlineStats for regressions? MixedModels?

Hello.

I’m planning to move from R to Julia and doing some tests about how to properly deal with large datasets and do simple tasks like regressions or survival analysis.

I’ve done a benchmark with R (microbenchmark) for the regressions.

N ← 3000
x1 ← rep(1:N, N)
x2 ← rep(1:N, each = N)
x3 ← sqrt(rep(1:N^2))
x1x2 ← x1x2
gg ← rep(1:5, each=N^2/5)
y ← 1-2
x1+3x2+0.5x1x2+rnorm(N^2)+x3*rnorm(N^2)
dat ← data.frame(y,x1,x2,x1x2,x3, gg)
dat2 ← cbind(1,x1,x2,x1x2,x3,gg)

lm(y ~ x1 + x2 + x1:x2 + x3, data = dat)
coef(.lm.fit(dat2, y))
lmfit(dat2, y)$be

Results:

lm 3.72s
.lm.fit 1.66s
Rfast lmfit 0.67s

Now I’ve tried to reproduce something similar on Julia with GLM and FixedEffectModels. This is my first day and maybe I need to improve many things.

N=3000
x1 = repeat(1:N, outer=N)
x2 = repeat(1:N, inner=N)
x3 = sqrt.(repeat(1:N^2))
x1x2 = x1 .* x2
gg = repeat(1:5,inner=trunc(Int,N^2/5))
y = 1 .- 2x1 + 3x2 + 0.5*x1x2 + rand(N^2) + x3.*rand(N^2)
data = DataFrame(x1=x1, x2=x2, x3=x3, x1x2=x1x2, y=y,gg=gg)
categorical!(data, :gg)

These are the results after several runs: (@time)

fit(LinearModel, @formula(y ~ x1+x2+x3+x1x2), data)
reg(data, @model(y ~ x1 + x2 + x3 + x1x2 ))
GLM.jl 2.4s
FixedEffectModels.jl 2.1s

They are both slightly faster than lm() but much slower than Rfast.
I’m using 4 threads on Julia and I have left R by default, I guess it uses 4 threads as well for these packages.
How can I make Julia regressions faster?
I will be using them with more complex models.

I’m not confident using the package FixedEffectModels because it has an option to specify “fixed effects” but I don’t understand the difference between that effects and the any other regular variable you use, both continuous or categorical. It’s not supposed to have random effects, then I don’t understand the distinction.

I’ve also tried to fit a random effects model with MixedModels.jl

fit(LinearMixedModel, @formula(y ~x1+x2+x3+x1x2 + (1|gg)), data)

It just takes 2.3s, almost the same than the GLM model. Maybe because it’s very simple or because the package is very optimized.
lme4 can’t do it with this large dataset. With smaller ones lme4 it takes 6 times more time, and mgcv bam lies in between.

If I wanted to work with larger datasets or more complex models none of these solutions work. R has a Revoscaler package but it doesn’t do regressions with random effects.
Julia also complains about the memory and MixedEffects says “ERROR: LinearAlgebra.RankDeficientException(1)”.

I think with Julia I have to try with OnlineStats.jl.
How can I run the former regressions (simple and random effects) using OnlineStats or JuliaDB or other solution?
For example for N=30000

Have you tried XGBoost?

I didn’t know it, I’m going to try.
Do you know how to easily translate my example to XGBoost?

I’m also trying Alistair.jl but it doesn’t compile with Julia 1.0.

I’ve also seen LsqFit.jl but I can’t figure out how to use it with my data.

Do XGBoost and GLM have anything in common except for them doing regression? GLM fits linear models whereas XGBoost fits boosted regression trees?

2 Likes

The Rfast package is faster because it accepts a matrix instead of a dataframe, and does not compute any standard error. Just use \ in Julia to do this:

N=3000
x1 = repeat(1:N, outer=N)
x2 = repeat(1:N, inner=N)
x3 = sqrt.(repeat(1:N^2))
x1x2 = x1 .* x2
gg = repeat(1:5,inner=trunc(Int,N^2/5))
y = 1 .- 2x1 + 3x2 + 0.5*x1x2 + rand(N^2) + x3.*rand(N^2)
dat2 = hcat(ones(length(x1)), x1, x2, x1x2, x3, gg)
@time dat2 \ y
#> 0.773727 seconds (84 allocations: 480.724 MiB, 2.50% gc time)
1 Like

Nice.
I guess you mean:

hcat(ones(length(x1)), x1, x2, x1x2, x3)

without the gg, because I didn’t include it on the simple models without random effects.
On my computer the speed of your dat2 \ y now is 1.4s, faster than GLM but still much slower than Rfast. And I expected much more speed from Julia.
The good news is it also uses much less memory (1/4).
And to be fair if we wanted to use this process many times on a loop we should also compare the time needed to create the hcat(ones(…)) matrix with the time to create the dataframe.

How can I further accelerate the speed of your division method?
Maybe using a different kind of structure or container for the data?
Using a different algorithm or using special macros?

Maybe the speed on R has also to do with it using MKL.

I’ve found a much faster solution for the regression with fixed effects,

using MultivariateStats
llsq(dat2, y; bias=false)

It’s just 0.21 seconds, though it doesn’t give you any information about variances, p-values…

The documentation it’s not clear but I’ve found it accepts the same matrix structure, the first column are ones, the other are the input variables.

I hope Alistairs.jl gets updated and we can try it on Julia 1.0

1 Like

All these benchmarks are about matrix division / multiplication etc. Julia is not better than matlab or python for this, since all these languages call BLAS.

Btw, if you look at the source code of llsq, in the background, llsq basically does

cholesky!(Symmetric(dat2' * dat2)) \ (dat2' *y)

that is, it solves the least square problem using a cholesky decomposition of the matrix dat2’dat2.
When you do

dat2 \ y

it solves the least squares problem using a QR decomposition of the matrix dat2. The QR method is usually slower but more precise (especially when the system is highly collinear). That is why it is the default — in most cases though, the cholesky method works well.

1 Like

I’ve seen the code from Alistair.jl, it doesn’t work on Julia but it’s supposed to be fast.

and they are using:
(X' * X) \ (X' * Y)

What’s the advantage of this method instead of X \ Y?

Writing code that does not work but is supposed to be fast must be a niche industry.

None. But as a compensation, it comes with plenty of disadvantages (numerical stability).

1 Like

I’ve being trying and it seems that

(dat2' * dat2) \ (dat2' * y)
gives the same coefficients than
lm(@formula(y ~ x1+x2+x3+x1x2), data)

Then the method should be right or at least GLM uses the same method.

On the other hand
dat2 \ y
gives coefficients slightly different and it’s much slower.

(dat2’ * dat2) \ (dat2’ * y) is faster because the products produce small matrices and the inverse is easier to calculate than calculating the inverse of a large matrix.

Then what should I trust?

(X'X)\(X'y) is the canonical way that OLS is estimated.

But what are you trying to do? In my experience, linear regression without standard errors is not very useful.

Mathemathically, of course it does. To see the error, you would need an ill-conditioned matrix.

Any reasonable text on numerical linear algebra?

You can pass the model matrix directly to fit instead of using a Formula and DataFrame. That will save you the costly computation of the model matrix. I.e.

julia> @time ft = fit(LinearModel, @formula(y ~ x1+x2+x3+x1x2), data);
  1.254742 seconds (540 allocations: 1.745 GiB, 37.68% gc time)

julia> @time fit(LinearModel, ft.mm.m, data.y);
  0.551531 seconds (26 allocations: 823.976 MiB, 41.03% gc time)

We are already using the Cholesky.

1 Like

How do you create that “ft.mm.m” without first running the first line?
I mean, if I want to use your second syntax but also need the first one I’ll end spending more time. If you already have the first why do you need the second?

Do you mean GLM.jl always uses Cholesky or the latter syntax with the model matrix uses Cholesky?
@matthieu mentioned QR is more stable and precise.

Matrix(df) will work, or you can dig into the unexported functions in StatsModels that are used to construct a model matrix.

1 Like

Does it need to have the same structure than the dataframe or the same than dat2 = hcat(ones(length(x1)), x1, x2, x1x2, x3) ?

I’m not a numeric analyst, but won’t the condition number of (X' * X) \ (X' * Y) be the square of that of X \ Y? Theoretically, they should give the same result, but for ill-conditioned problems, the first formulation is horrible in comparison? With the original problem being to find the LS solution of X*b = Y, a standard procedure would be to do the QR factorization of X. I assume that X \ Y is doing that.

If you want to test this for yourself: splitting up the QR factorization, X = Q*R = [Q1 Q2]*[R1;R2] with Q1 having r = rank(X) columns and using Q'*Q = I, leads to two sets of equations, [ahem… fixing this: R1*X and R2*X should of course be R1*b and R2*b…] R1*b = Q1' * Y and R2*b = Q2' * Y. The first is trivial to solve as R1 is upper triangular. For the second, in principle R2 = 0, so one has 0 = Q2'*Y: if the model X*b is a perfect description of the data Y, then Y should lie in the nullpace of Q2', hence Q2'*Y = 0. With an imperfect model (e.g., noisy data), Q2'*Y contains information about the squared error; this equation is ditched.

Of course, in doing the QR factorization, one doesn’t need Q and R – it suffices to keep Q1 and R1 – which normally are much smaller than Q and R.

For many problems, X is of full rank, and then the algorithm is straightforward. If X has some rank loss, it is necessary to find the rank. This is sometimes done using costly SVD computations. There is a “rank revealing” QR factorization which orders the main diagonal elements of R, but I don’t think that algorithms is available in Julia (???). There is a MATLAB version of it, but apparently the MATLAB implementation is copyrighted.

using LinearAlgebra
X = hcat(ones(100), rand(100, 2))
X = hcat(X, 1.5 * X[:,2])
F = qr(X, Val(true))
good = abs.(diag(F.R)) .> √eps()
@views if ~all(good)
    X = X[:,good[invperm(F.jpvt)]]
    F = qr(X, Val(true))
end

You can use F.jpvt to get the full rank matrix.

2 Likes