Three million linear regressions

Baba_Yara_Fahiz · May 21, 2022, 10:03pm

I am running three million regressions of the form below and I feel like I am leaving a lot of performance on the table.

Can I get some tips to help speed things up?

function test_1(data_0)
EignVectors = rand(100, 12)
Factors     = zeros(size(data_0, 1), 12)

for i in 1:size(Factors, 1)
    Indexer = Not(ismissing.(data_0[i, :]))                                   # real data contains some missing values
  Factors[i, :] = EignVectors[Indexer, :]\data_0[i, Indexer]                     # all I need are the slope coefficients
end
return Factors
end 
data_1 =  rand(3000000, 100)
@time test_1( data_1 );

Oscar_Smith · May 21, 2022, 10:46pm

@views is your friend.

stevengj · May 21, 2022, 10:49pm

Did you read the performance tips

Try using views to avoid making copies with slices. Replace the Not with (!). since you are allocating an array anyway with isimissing., and perhaps pre-allocate the Indexer array. Try changing the order of (transposing) your data array so that you access the data in memory order.

Unfortunately, having irregular missing values makes things a lot worse — if it weren’t for that, you could (after transposing the data array) replace the entire loop with a single Factors = EignVectors \ data_0 call, which would probably be much faster. Maybe consider sorting your data into chunks that have identical ismissing patterns, so that you can do the \ in chunks.

stevengj · May 21, 2022, 10:59pm

In particular, this is about 50x faster on my machine.

Baba_Yara_Fahiz · May 23, 2022, 12:52am

Thanks for all the comments.
I managed to implement the suggestions and saw a significant increase in speed.

Topic		Replies	Views
Error when processing data from file for linear regression Performance question	7	279	June 5, 2023
Why is my code very slow General Usage performance	16	680	June 23, 2023
Accelerating linear methods Machine Learning question	11	689	June 3, 2023
Ways to speed up this code Performance	4	677	September 17, 2019
Efficient way of doing linear regression Performance regression	44	20608	February 7, 2022

Three million linear regressions

Related topics