Three million linear regressions

I am running three million regressions of the form below and I feel like I am leaving a lot of performance on the table.

Can I get some tips to help speed things up?

function test_1(data_0)
EignVectors = rand(100, 12)
Factors     = zeros(size(data_0, 1), 12)

for i in 1:size(Factors, 1)
    Indexer = Not(ismissing.(data_0[i, :]))                                   # real data contains some missing values
  Factors[i, :] = EignVectors[Indexer, :]\data_0[i, Indexer]                     # all I need are the slope coefficients
end
return Factors
end 
data_1 =  rand(3000000, 100)
@time test_1( data_1 );

@views is your friend.

2 Likes

Did you read the performance tips

Try using views to avoid making copies with slices. Replace the Not with (!). since you are allocating an array anyway with isimissing., and perhaps pre-allocate the Indexer array. Try changing the order of (transposing) your data array so that you access the data in memory order.

Unfortunately, having irregular missing values makes things a lot worse — if it weren’t for that, you could (after transposing the data array) replace the entire loop with a single Factors = EignVectors \ data_0 call, which would probably be much faster. Maybe consider sorting your data into chunks that have identical ismissing patterns, so that you can do the \ in chunks.

8 Likes

In particular, this is about 50x faster on my machine.

3 Likes

Thanks for all the comments.
I managed to implement the suggestions and saw a significant increase in speed.