Parallel/multithreading version of linear regression

Hi there,
I’m new to Julia.
I implemented regression model with interactions using Julia GLM package:
Reg = lm(@formula(dep_var ~ var1&var2&var3), data, true).
But fitting this formula requires a lot of RAM (> 80 GB). I noticed that the calculations are performed on one core, although my OS (x86_64-pc-linux-gnu) has 8 cpu cores.
Is it possible to implement linear regression using multiprocessing/parallelism approaches?
I suppose, it could also improve the model runtime.
Thanks a lot for any recommendation.

Welcome!

In most optimization problems the cost is associated with the calculation of the objective function and its gradient. Thus, you could implement parallel versions of the function and gradient, and use any solver. To get some help on that you would need to be more explicit in the structure of the function that you want to fit.

Edit: of course in this case it will be a sum of squares of some difference between the data and the model. That can be parallelized, and quite easily.

Edit 2: Likely Flux does that (as suggested by ckneale), and parallelizes the sum of squares to the GPU.

Just a comment: the function that evaluates the sum of squares of a model relative to the
data should not in principle allocate anything. It might be that the definition of the model is suboptimal, nor preallocating auxiliary arrays, using global variables, or has type-instabilities, etc, that are leading to these allocations and, perhaps, to the lack of performance.

I’d start by creating the design matrix manually and then use the linearalgebra library. E.g.

X\depvar

Also take a look at the low level api functions in LinearAlgebra.

If you perform OLS via gradient descent as in use Flux (single linear layer no activation function), you can train this batchwise(<<80gb of RAM). It may not converge to give you the same answer as the analytic answer but it should be close. This will also let you use a GPU for faster training.

Is there a reason why you need 80gb of data for a 3 variable OLS problem? Can’t you use a subset of the data?

Thank you very much for the all answers!
I created code based on the examples from GLM documentation
My code looks like this:

using Pkg
Pkg.add("DataFrames")
Pkg.add("CSV")
Pkg.add("GLM")

using DataFrames
using CSV
using GLM

data = CSV.read("/home/data_for_modelling.csv")
categorical!(data, compress=true)
ols = lm(@formula(Y ~ A&B&C), data, true)
pred = predict(ols)
...
# next I extract different model coefficients (it doesn't take a lot of RAM or time).
# A - is numeric, B and C - categorical (factor) variables.

The largest memory usage is observed when executing lm(@formula(Y ~ A&B&C), data, true) (please, see the screenshot below from Zabbix monitoring that demonstrates the memory allocation during lm execution).

My initial dataset (data_for_modelling.csv) is 330MB (444378 obs and 82 variables).
Number of unique combinations (A&B&C) - about 1.3x10^9

Thanks a lot for all advices!

That is an ABSURD amount of memory given where you start. I think so anyway. Have you tried passing just the columns of data that are relevant for the regression?

subdata = select(data,:Y,:A,:B,:C)

How large is the data set? What happens if you just estimate main effects (Y ~ A + B + C)?

1 Like

yes, I tried to pass only those columns that are used in the regression, but this didn’t improve the situation - memory allocation was the same (about 70-80 GB)

It takes a few seconds and about 3 GB RAM to compute regression Y ~ A + B + C (with only main effects). The initial dataset is 330 MB (the dimension of the dataset is indicated above).

Number of unique combinations (A&B&C) - about 1.3x10^9

I don’t have much experience using GLM or data frames, but, for what it’s worth, the basic problem of running a linear regression with 500000 observations, 3 regressors, and the three interaction terms can be done very quickly, and using a reasonable amount of memory:

function main()
    n = 500000
    x = randn(n,3)
    x = [ones(n) x x[:,1].*x[:,2] x[:,1].*x[:,3] x[:,2].*x[:,3]]
    y = randn(n,1)
    x\y
end
@time main()

the output is

j

julia> include("junk.jl")
  0.076591 seconds (89 allocations: 110.700 MiB, 6.83% gc time)
7×1 Array{Float64,2}:
 -0.00011835995385762996
  0.001479958265456721
 -0.0005402138432470425
  2.8314036830937423e-5
 -0.0014730582987404113
  0.0007577982782133455
  0.0012875642802494675

Given that this problem has a closed form solution, why would one want to use an iterative solver like Flux?

2 Likes

Thank you so much for the reply!
Your example works very fast for me too and allocates only ~300MiB RAM.
But in my initial data, the numeric variable is only A (198932 unique values), while B is a categorical variable with 412 levels, C is also a categorical variable with 16 levels.
Could handling categorical variables be so problematic?

You seem to be turning the whole dataset into categorical variables-- which leads me to think that you would like to run a regression not with three regressors, but with fixed effects. GLM isn’t made for that, and thus it’s not surprising that it does not perform well. What is the model that you would like to estimate?

In case you would like to run regressions with large sets of categorical variables, you should have a look at
https://github.com/FixedEffects/FixedEffectModels.jl
This package supports parallel de-meaning, if that’s what you’re after.

In case you’re after Bai (2009) interactive fixed effect models, check out
https://github.com/FixedEffects/InteractiveFixedEffectModels.jl

3 Likes

Ah, now I understand, from this and jmboehm’s answer. If GLM creates a regressor for each unique combination of the categorical effects, then you have an enormous number of regressors, larger than the sample size. In that case, there is no unique solution to the regression problem. Using Flux, you could in principle train to some coefficients such that the sum of squares is zero, but that solution would not be unique. At any rate, that does explain why the memory usage is so large.

1 Like

My dataset has 444378 observations, but number of unique combination of the categorical effects (B&C) is 6608.

Thanks a lot, I’ll try these packages!

Slightly off topic for the specific question, but on topic for the actual thread topic - shouldn’t a plain vanilla lm() call in GLM be multithreaded though, as it would at its core perform some matrix algebra using OpenBLAS, which I understood to be multithreaded by default?

So I would have expected something simple like

using DataFrames, GLM

lm(@formula(x1 ~ x2 + x3 + x4 + x5), DataFrame(rand(1_000_000, 5)))

to run multithreaded, but it seems to only max out one core?

1 Like

It probable depends on the size of the matrices and the types of operations you’re doing right? That’s a pretty skinny matrix so I’m not sure there’s much benefit to multithreading but that’s way above my pay grade…

The reason you’re running into RAM issues is that StatsModels (which GLM uses under the hood) is constructing a full dense 444378x6608 model matrix, which is mostly zeros because of the large number of categorical levels. In principle StatsModels supports sparse model matrices but the support for that isn’t very good and I don’t think there’s a way to avoid constructing the full dense matrix to begin with…

1 Like

@dave.f.kleinschmidt, thanks a lot for your answer!
I fully agree with you. That’s why I’ll try to sparse one hot encoded features before fitting the linear model.
However, the FixedEffectModels.jl package also showed excellent results, but I have some difficulties in extracting coefficients and other model parameters :confused: