I am running three million regressions of the form below and I feel like I am leaving a lot of performance on the table.
Can I get some tips to help speed things up?
EignVectors = rand(100, 12)
Factors = zeros(size(data_0, 1), 12)
for i in 1:size(Factors, 1)
Indexer = Not(ismissing.(data_0[i, :])) # real data contains some missing values
Factors[i, :] = EignVectors[Indexer, :]\data_0[i, Indexer] # all I need are the slope coefficients
data_1 = rand(3000000, 100)
@time test_1( data_1 );
Did you read the performance tips
Try using views to avoid making copies with slices. Replace the
(!). since you are allocating an array anyway with
isimissing., and perhaps pre-allocate the
Indexer array. Try changing the order of (transposing) your
data array so that you access the data in memory order.
Unfortunately, having irregular missing values makes things a lot worse — if it weren’t for that, you could (after transposing the data array) replace the entire loop with a single
Factors = EignVectors \ data_0 call, which would probably be much faster. Maybe consider sorting your data into chunks that have identical
ismissing patterns, so that you can do the
\ in chunks.
In particular, this is about 50x faster on my machine.
Thanks for all the comments.
I managed to implement the suggestions and saw a significant increase in speed.