Hello.
I’m planning to move from R to Julia and doing some tests about how to properly deal with large datasets and do simple tasks like regressions or survival analysis.
I’ve done a benchmark with R (microbenchmark) for the regressions.
N ← 3000
x1 ← rep(1:N, N)
x2 ← rep(1:N, each = N)
x3 ← sqrt(rep(1:N^2))
x1x2 ← x1x2
gg ← rep(1:5, each=N^2/5)
y ← 1-2x1+3x2+0.5x1x2+rnorm(N^2)+x3*rnorm(N^2)
dat ← data.frame(y,x1,x2,x1x2,x3, gg)
dat2 ← cbind(1,x1,x2,x1x2,x3,gg)
lm(y ~ x1 + x2 + x1:x2 + x3, data = dat)
coef(.lm.fit(dat2, y))
lmfit(dat2, y)$be
Results:
lm 3.72s
.lm.fit 1.66s
Rfast lmfit 0.67s
Now I’ve tried to reproduce something similar on Julia with GLM and FixedEffectModels. This is my first day and maybe I need to improve many things.
N=3000
x1 = repeat(1:N, outer=N)
x2 = repeat(1:N, inner=N)
x3 = sqrt.(repeat(1:N^2))
x1x2 = x1 .* x2
gg = repeat(1:5,inner=trunc(Int,N^2/5))
y = 1 .- 2x1 + 3x2 + 0.5*x1x2 + rand(N^2) + x3.*rand(N^2)
data = DataFrame(x1=x1, x2=x2, x3=x3, x1x2=x1x2, y=y,gg=gg)
categorical!(data, :gg)
These are the results after several runs: (@time)
fit(LinearModel, @formula(y ~ x1+x2+x3+x1x2), data)
reg(data, @model(y ~ x1 + x2 + x3 + x1x2 ))
GLM.jl 2.4s
FixedEffectModels.jl 2.1s
They are both slightly faster than lm() but much slower than Rfast.
I’m using 4 threads on Julia and I have left R by default, I guess it uses 4 threads as well for these packages.
How can I make Julia regressions faster?
I will be using them with more complex models.
I’m not confident using the package FixedEffectModels because it has an option to specify “fixed effects” but I don’t understand the difference between that effects and the any other regular variable you use, both continuous or categorical. It’s not supposed to have random effects, then I don’t understand the distinction.
I’ve also tried to fit a random effects model with MixedModels.jl
fit(LinearMixedModel, @formula(y ~x1+x2+x3+x1x2 + (1|gg)), data)
It just takes 2.3s, almost the same than the GLM model. Maybe because it’s very simple or because the package is very optimized.
lme4 can’t do it with this large dataset. With smaller ones lme4 it takes 6 times more time, and mgcv bam lies in between.
If I wanted to work with larger datasets or more complex models none of these solutions work. R has a Revoscaler package but it doesn’t do regressions with random effects.
Julia also complains about the memory and MixedEffects says “ERROR: LinearAlgebra.RankDeficientException(1)”.
I think with Julia I have to try with OnlineStats.jl.
How can I run the former regressions (simple and random effects) using OnlineStats or JuliaDB or other solution?
For example for N=30000